(re-sending, first message seems to have gotten lost) I was referred here by Ian Campbell ijc@hellion.org.uk from bugs.debian.org. First, I''m happy to provide more information about this bug as requsted. I recognize not all relevant data has been collected yet. Detailed information about this bug can be found at http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=705124. The executive summary is: Using Debian Testing (7.0, wheezy) dom0 with LVM and full disk encryption with Debian Stable (6.0, Squeeze) domU, transferring large files via scp or rsync over openswan results in data corruption, with eventual file system corruption. The culprit appears to be full disk encryption, however that evidence may not be conclusive. While I don''t mind providing additional information, I''d hate to have to repeat the information I''ve provided to the Debian bug hunting folks. Thanks in advance for any help you can provide.
On Tue, 2013-04-16 at 18:39 +0100, Anthony Sheetz wrote:> (re-sending, first message seems to have gotten lost) > > I was referred here by Ian Campbell ijc@hellion.org.uk from bugs.debian.org.I''m here too (different hat ;-)), thanks for posting it here. I''ve added some people who know about the block stuff to the CC. Guys, my suspicion is that the issue is that barriers issued by ext3 inside the guest aren''t making it all the way down the ext3->blkfront->blkback->lvm->dm-crypt->disk chain leading the filesystem to eventually corrupt itself. The issue seems to relate to the use of dm-crypt since ext3->blkfront->blkback->lvm->disk is reported work fine. However there is no problem with the local dom0 ext3 root filesystem which is also in the same lvm VG on the crypt device (i.e. ext3->lvm->dm-crypt->disk), so its not purely a dm-crypt issue. I figure something is up at the blkfront->back link which causes the barriers which blkback is injecting into the block subsystem either don''t make it to the dm-crypt layer or do not DTRT once they arrive. I''m not really sure with how to proceed (or how to ask Anthony to proceed) with verifying any part of that hypothesis though. ISTR issues with old vs new style barriers or barriers with no data in them or something, could this be related to that? (or am I thinking of DISCARD?) The issue was initially reported with Squeeze (Jeremy 2.6.32 tree) domU on a Wheezy (mainline 3.2) dom0 but IIRC has also been repeated with Wheezy on Wheezy now so this isn''t cross version confusion about barrier semantics AFAICT. Ian.> First, I''m happy to provide more information about this bug as > requsted. I recognize not all relevant data has > been collected yet. > > Detailed information about this bug can be found at > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=705124. > > The executive summary is: Using Debian Testing (7.0, wheezy) dom0 with > LVM and full disk encryption with > Debian Stable (6.0, Squeeze) domU, transferring large files via scp or > rsync over openswan results in data corruption, with > eventual file system corruption. The culprit appears to be full disk > encryption, however that evidence may not be conclusive. > > While I don''t mind providing additional information, I''d hate to have > to repeat the information I''ve provided to the Debian bug hunting > folks. > > Thanks in advance for any help you can provide. > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
I realize folks are pretty busy, but we''re still interested in getting this problem solved, and I want to be sure it''s not lost in the shuffle. Any chance of getting some attention for it? On Wed, Apr 17, 2013 at 9:00 AM, Ian Campbell <Ian.Campbell@citrix.com> wrote:> On Tue, 2013-04-16 at 18:39 +0100, Anthony Sheetz wrote: >> (re-sending, first message seems to have gotten lost) >> >> I was referred here by Ian Campbell ijc@hellion.org.uk from bugs.debian.org. > > I''m here too (different hat ;-)), thanks for posting it here. I''ve added > some people who know about the block stuff to the CC. > > Guys, my suspicion is that the issue is that barriers issued by ext3 > inside the guest aren''t making it all the way down the > ext3->blkfront->blkback->lvm->dm-crypt->disk chain leading the > filesystem to eventually corrupt itself. > > The issue seems to relate to the use of dm-crypt since > ext3->blkfront->blkback->lvm->disk is reported work fine. > > However there is no problem with the local dom0 ext3 root filesystem > which is also in the same lvm VG on the crypt device (i.e. > ext3->lvm->dm-crypt->disk), so its not purely a dm-crypt issue. I figure > something is up at the blkfront->back link which causes the barriers > which blkback is injecting into the block subsystem either don''t make it > to the dm-crypt layer or do not DTRT once they arrive. > > I''m not really sure with how to proceed (or how to ask Anthony to > proceed) with verifying any part of that hypothesis though. > > ISTR issues with old vs new style barriers or barriers with no data in > them or something, could this be related to that? (or am I thinking of > DISCARD?) > > The issue was initially reported with Squeeze (Jeremy 2.6.32 tree) domU > on a Wheezy (mainline 3.2) dom0 but IIRC has also been repeated with > Wheezy on Wheezy now so this isn''t cross version confusion about barrier > semantics AFAICT. > > Ian. > >> First, I''m happy to provide more information about this bug as >> requsted. I recognize not all relevant data has >> been collected yet. >> >> Detailed information about this bug can be found at >> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=705124. >> >> The executive summary is: Using Debian Testing (7.0, wheezy) dom0 with >> LVM and full disk encryption with >> Debian Stable (6.0, Squeeze) domU, transferring large files via scp or >> rsync over openswan results in data corruption, with >> eventual file system corruption. The culprit appears to be full disk >> encryption, however that evidence may not be conclusive. >> >> While I don''t mind providing additional information, I''d hate to have >> to repeat the information I''ve provided to the Debian bug hunting >> folks. >> >> Thanks in advance for any help you can provide. >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xen.org >> http://lists.xen.org/xen-devel > >
Konrad is on vacation this week, so it''ll probably be next week before this gets looked at by him. Ian. On Mon, 2013-04-22 at 13:22 +0100, Anthony Sheetz wrote:> I realize folks are pretty busy, but we''re still interested in getting > this problem solved, and I want to be sure it''s not lost in the > shuffle. > Any chance of getting some attention for it? > > On Wed, Apr 17, 2013 at 9:00 AM, Ian Campbell <Ian.Campbell@citrix.com> wrote: > > On Tue, 2013-04-16 at 18:39 +0100, Anthony Sheetz wrote: > >> (re-sending, first message seems to have gotten lost) > >> > >> I was referred here by Ian Campbell ijc@hellion.org.uk from bugs.debian.org. > > > > I''m here too (different hat ;-)), thanks for posting it here. I''ve added > > some people who know about the block stuff to the CC. > > > > Guys, my suspicion is that the issue is that barriers issued by ext3 > > inside the guest aren''t making it all the way down the > > ext3->blkfront->blkback->lvm->dm-crypt->disk chain leading the > > filesystem to eventually corrupt itself. > > > > The issue seems to relate to the use of dm-crypt since > > ext3->blkfront->blkback->lvm->disk is reported work fine. > > > > However there is no problem with the local dom0 ext3 root filesystem > > which is also in the same lvm VG on the crypt device (i.e. > > ext3->lvm->dm-crypt->disk), so its not purely a dm-crypt issue. I figure > > something is up at the blkfront->back link which causes the barriers > > which blkback is injecting into the block subsystem either don''t make it > > to the dm-crypt layer or do not DTRT once they arrive. > > > > I''m not really sure with how to proceed (or how to ask Anthony to > > proceed) with verifying any part of that hypothesis though. > > > > ISTR issues with old vs new style barriers or barriers with no data in > > them or something, could this be related to that? (or am I thinking of > > DISCARD?) > > > > The issue was initially reported with Squeeze (Jeremy 2.6.32 tree) domU > > on a Wheezy (mainline 3.2) dom0 but IIRC has also been repeated with > > Wheezy on Wheezy now so this isn''t cross version confusion about barrier > > semantics AFAICT. > > > > Ian. > > > >> First, I''m happy to provide more information about this bug as > >> requsted. I recognize not all relevant data has > >> been collected yet. > >> > >> Detailed information about this bug can be found at > >> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=705124. > >> > >> The executive summary is: Using Debian Testing (7.0, wheezy) dom0 with > >> LVM and full disk encryption with > >> Debian Stable (6.0, Squeeze) domU, transferring large files via scp or > >> rsync over openswan results in data corruption, with > >> eventual file system corruption. The culprit appears to be full disk > >> encryption, however that evidence may not be conclusive. > >> > >> While I don''t mind providing additional information, I''d hate to have > >> to repeat the information I''ve provided to the Debian bug hunting > >> folks. > >> > >> Thanks in advance for any help you can provide. > >> > >> _______________________________________________ > >> Xen-devel mailing list > >> Xen-devel@lists.xen.org > >> http://lists.xen.org/xen-devel > > > >
I would once again like to request help with a bug in Xen. Repeating message from April 16th: First, I''m happy to provide more information about this bug as requsted. I recognize not all relevant data has been collected yet. Detailed information about this bug can be found at http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=705124. The executive summary is: Using Debian Testing (7.0, wheezy) dom0 with LVM and full disk encryption with Debian Stable (6.0, Squeeze) domU, transferring large files via scp or rsync over openswan results in data corruption, with eventual file system corruption. The culprit appears to be full disk encryption, however that evidence may not be conclusive. While I don''t mind providing additional information, I''d hate to have to repeat the information I''ve provided to the Debian bug hunting folks. Thanks in advance for any help you can provide. On Tue, Apr 16, 2013 at 1:39 PM, Anthony Sheetz <sheetzam@inspire.com> wrote:> (re-sending, first message seems to have gotten lost) > > I was referred here by Ian Campbell ijc@hellion.org.uk from bugs.debian.org. > > First, I''m happy to provide more information about this bug as > requsted. I recognize not all relevant data has > been collected yet. > > Detailed information about this bug can be found at > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=705124. > > The executive summary is: Using Debian Testing (7.0, wheezy) dom0 with > LVM and full disk encryption with > Debian Stable (6.0, Squeeze) domU, transferring large files via scp or > rsync over openswan results in data corruption, with > eventual file system corruption. The culprit appears to be full disk > encryption, however that evidence may not be conclusive. > > While I don''t mind providing additional information, I''d hate to have > to repeat the information I''ve provided to the Debian bug hunting > folks. > > Thanks in advance for any help you can provide.
On Mon, Apr 22, 2013 at 01:26:34PM +0100, Ian Campbell wrote:> Konrad is on vacation this week, so it''ll probably be next week before > this gets looked at by him.And I finally got to this email in my ''vacation-mbox''> > Ian. > > On Mon, 2013-04-22 at 13:22 +0100, Anthony Sheetz wrote: > > I realize folks are pretty busy, but we''re still interested in getting > > this problem solved, and I want to be sure it''s not lost in the > > shuffle. > > Any chance of getting some attention for it? > > > > On Wed, Apr 17, 2013 at 9:00 AM, Ian Campbell <Ian.Campbell@citrix.com> wrote: > > > On Tue, 2013-04-16 at 18:39 +0100, Anthony Sheetz wrote: > > >> (re-sending, first message seems to have gotten lost) > > >> > > >> I was referred here by Ian Campbell ijc@hellion.org.uk from bugs.debian.org. > > > > > > I''m here too (different hat ;-)), thanks for posting it here. I''ve added > > > some people who know about the block stuff to the CC. > > > > > > Guys, my suspicion is that the issue is that barriers issued by ext3 > > > inside the guest aren''t making it all the way down the > > > ext3->blkfront->blkback->lvm->dm-crypt->disk chain leading the > > > filesystem to eventually corrupt itself. > > > > > > The issue seems to relate to the use of dm-crypt since > > > ext3->blkfront->blkback->lvm->disk is reported work fine. > > > > > > However there is no problem with the local dom0 ext3 root filesystem > > > which is also in the same lvm VG on the crypt device (i.e. > > > ext3->lvm->dm-crypt->disk), so its not purely a dm-crypt issue. I figure > > > something is up at the blkfront->back link which causes the barriers > > > which blkback is injecting into the block subsystem either don''t make it > > > to the dm-crypt layer or do not DTRT once they arrive. > > > > > > I''m not really sure with how to proceed (or how to ask Anthony to > > > proceed) with verifying any part of that hypothesis though. > > > > > > ISTR issues with old vs new style barriers or barriers with no data in > > > them or something, could this be related to that? (or am I thinking of > > > DISCARD?)You are using two different kernel versions. The 2.6.32 domU is only using WRITE_BARRIERs, while in the 3.2 kernels that have been completly eliminated. The mechanism they use is called ''WRITE_FLUSH''. The 3.2 kernel has a patch: ommit 29bde093787f3bdf7b9b4270ada6be7c8076e36b Author: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Date: Mon Oct 10 00:42:22 2011 -0400 xen/blkback: Support ''feature-barrier'' aka old-style BARRIER requests. which emulates the barrier request by draining all of the oustanding I/Os and then sending the WRITE_FLUSH. But it looks like you are hitting an issue here. Just to make sure that is the case, what happens if you use the _same_ kernel in both dom0 and domU? Does it work then?> > > > > > The issue was initially reported with Squeeze (Jeremy 2.6.32 tree) domU > > > on a Wheezy (mainline 3.2) dom0 but IIRC has also been repeated with > > > Wheezy on Wheezy now so this isn''t cross version confusion about barrier > > > semantics AFAICT. > > > > > > Ian. > > > > > >> First, I''m happy to provide more information about this bug as > > >> requsted. I recognize not all relevant data has > > >> been collected yet. > > >> > > >> Detailed information about this bug can be found at > > >> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=705124. > > >> > > >> The executive summary is: Using Debian Testing (7.0, wheezy) dom0 with > > >> LVM and full disk encryption with > > >> Debian Stable (6.0, Squeeze) domU, transferring large files via scp or > > >> rsync over openswan results in data corruption, with > > >> eventual file system corruption. The culprit appears to be full disk > > >> encryption, however that evidence may not be conclusive. > > >> > > >> While I don''t mind providing additional information, I''d hate to have > > >> to repeat the information I''ve provided to the Debian bug hunting > > >> folks. > > >> > > >> Thanks in advance for any help you can provide. > > >> > > >> _______________________________________________ > > >> Xen-devel mailing list > > >> Xen-devel@lists.xen.org > > >> http://lists.xen.org/xen-devel > > > > > > > >
On Wed, May 22, 2013 at 4:10 PM, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:> On Mon, Apr 22, 2013 at 01:26:34PM +0100, Ian Campbell wrote: >> Konrad is on vacation this week, so it''ll probably be next week before >> this gets looked at by him. > > And I finally got to this email in my ''vacation-mbox'' >> >> Ian. >> >> On Mon, 2013-04-22 at 13:22 +0100, Anthony Sheetz wrote: >> > I realize folks are pretty busy, but we''re still interested in getting >> > this problem solved, and I want to be sure it''s not lost in the >> > shuffle. >> > Any chance of getting some attention for it? >> > >> > On Wed, Apr 17, 2013 at 9:00 AM, Ian Campbell <Ian.Campbell@citrix.com> wrote: >> > > On Tue, 2013-04-16 at 18:39 +0100, Anthony Sheetz wrote: >> > >> (re-sending, first message seems to have gotten lost) >> > >> >> > >> I was referred here by Ian Campbell ijc@hellion.org.uk from bugs.debian.org. >> > > >> > > I''m here too (different hat ;-)), thanks for posting it here. I''ve added >> > > some people who know about the block stuff to the CC. >> > > >> > > Guys, my suspicion is that the issue is that barriers issued by ext3 >> > > inside the guest aren''t making it all the way down the >> > > ext3->blkfront->blkback->lvm->dm-crypt->disk chain leading the >> > > filesystem to eventually corrupt itself. >> > > >> > > The issue seems to relate to the use of dm-crypt since >> > > ext3->blkfront->blkback->lvm->disk is reported work fine. >> > > >> > > However there is no problem with the local dom0 ext3 root filesystem >> > > which is also in the same lvm VG on the crypt device (i.e. >> > > ext3->lvm->dm-crypt->disk), so its not purely a dm-crypt issue. I figure >> > > something is up at the blkfront->back link which causes the barriers >> > > which blkback is injecting into the block subsystem either don''t make it >> > > to the dm-crypt layer or do not DTRT once they arrive. >> > > >> > > I''m not really sure with how to proceed (or how to ask Anthony to >> > > proceed) with verifying any part of that hypothesis though. >> > > >> > > ISTR issues with old vs new style barriers or barriers with no data in >> > > them or something, could this be related to that? (or am I thinking of >> > > DISCARD?) > > You are using two different kernel versions. The 2.6.32 domU is only using > WRITE_BARRIERs, while in the 3.2 kernels that have been completly eliminated. > The mechanism they use is called ''WRITE_FLUSH''. The 3.2 kernel has a patch: > ommit 29bde093787f3bdf7b9b4270ada6be7c8076e36b > Author: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > Date: Mon Oct 10 00:42:22 2011 -0400 > > xen/blkback: Support ''feature-barrier'' aka old-style BARRIER requests. > > > which emulates the barrier request by draining all of the oustanding I/Os and then > sending the WRITE_FLUSH. > > But it looks like you are hitting an issue here. Just to make sure > that is the case, what happens if you use the _same_ kernel in both dom0 and > domU? Does it work then? >First, thank you so much for getting back to me, it''s really appreciated. At this point I''ve forgotten if I did this with Wheezy on Wheezy, and what the result was. I''ll have to test using the 3.2 kernel on the domU Debian Squeeze and get back to you. I should be able to do that early next week.>> > > >> > > The issue was initially reported with Squeeze (Jeremy 2.6.32 tree) domU >> > > on a Wheezy (mainline 3.2) dom0 but IIRC has also been repeated with >> > > Wheezy on Wheezy now so this isn''t cross version confusion about barrier >> > > semantics AFAICT. >> > > >> > > Ian.
On Thu, May 23, 2013 at 02:19:50PM -0400, Anthony Sheetz wrote:> On Wed, May 22, 2013 at 4:10 PM, Konrad Rzeszutek Wilk > <konrad.wilk@oracle.com> wrote: > > On Mon, Apr 22, 2013 at 01:26:34PM +0100, Ian Campbell wrote: > >> Konrad is on vacation this week, so it''ll probably be next week before > >> this gets looked at by him. > > > > And I finally got to this email in my ''vacation-mbox'' > >> > >> Ian. > >> > >> On Mon, 2013-04-22 at 13:22 +0100, Anthony Sheetz wrote: > >> > I realize folks are pretty busy, but we''re still interested in getting > >> > this problem solved, and I want to be sure it''s not lost in the > >> > shuffle. > >> > Any chance of getting some attention for it? > >> > > >> > On Wed, Apr 17, 2013 at 9:00 AM, Ian Campbell <Ian.Campbell@citrix.com> wrote: > >> > > On Tue, 2013-04-16 at 18:39 +0100, Anthony Sheetz wrote: > >> > >> (re-sending, first message seems to have gotten lost) > >> > >> > >> > >> I was referred here by Ian Campbell ijc@hellion.org.uk from bugs.debian.org. > >> > > > >> > > I''m here too (different hat ;-)), thanks for posting it here. I''ve added > >> > > some people who know about the block stuff to the CC. > >> > > > >> > > Guys, my suspicion is that the issue is that barriers issued by ext3 > >> > > inside the guest aren''t making it all the way down the > >> > > ext3->blkfront->blkback->lvm->dm-crypt->disk chain leading the > >> > > filesystem to eventually corrupt itself. > >> > > > >> > > The issue seems to relate to the use of dm-crypt since > >> > > ext3->blkfront->blkback->lvm->disk is reported work fine. > >> > > > >> > > However there is no problem with the local dom0 ext3 root filesystem > >> > > which is also in the same lvm VG on the crypt device (i.e. > >> > > ext3->lvm->dm-crypt->disk), so its not purely a dm-crypt issue. I figure > >> > > something is up at the blkfront->back link which causes the barriers > >> > > which blkback is injecting into the block subsystem either don''t make it > >> > > to the dm-crypt layer or do not DTRT once they arrive. > >> > > > >> > > I''m not really sure with how to proceed (or how to ask Anthony to > >> > > proceed) with verifying any part of that hypothesis though. > >> > > > >> > > ISTR issues with old vs new style barriers or barriers with no data in > >> > > them or something, could this be related to that? (or am I thinking of > >> > > DISCARD?) > > > > You are using two different kernel versions. The 2.6.32 domU is only using > > WRITE_BARRIERs, while in the 3.2 kernels that have been completly eliminated. > > The mechanism they use is called ''WRITE_FLUSH''. The 3.2 kernel has a patch: > > ommit 29bde093787f3bdf7b9b4270ada6be7c8076e36b > > Author: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > > Date: Mon Oct 10 00:42:22 2011 -0400 > > > > xen/blkback: Support ''feature-barrier'' aka old-style BARRIER requests. > > > > > > which emulates the barrier request by draining all of the oustanding I/Os and then > > sending the WRITE_FLUSH. > > > > But it looks like you are hitting an issue here. Just to make sure > > that is the case, what happens if you use the _same_ kernel in both dom0 and > > domU? Does it work then? > > > > First, thank you so much for getting back to me, it''s really appreciated. > At this point I''ve forgotten if I did this with Wheezy on Wheezy, and > what the result was. > I''ll have to test using the 3.2 kernel on the domU Debian Squeeze and > get back to you. I should be able to do that early next week.Thank you. Also when you do this test, could you also provide the ''xenstore-ls'' output from dom0? And the ''dmesg'' output from the guest (or at least the ''xl console <guest> | tee /tmp/log'' ? That would give me and idea if the frontend/backend have the right negotiation parameters. Have a good weekend!
On 17/04/13 15:00, Ian Campbell wrote:> On Tue, 2013-04-16 at 18:39 +0100, Anthony Sheetz wrote: >> (re-sending, first message seems to have gotten lost) >> >> I was referred here by Ian Campbell ijc@hellion.org.uk from bugs.debian.org. > > I''m here too (different hat ;-)), thanks for posting it here. I''ve added > some people who know about the block stuff to the CC. > > Guys, my suspicion is that the issue is that barriers issued by ext3 > inside the guest aren''t making it all the way down the > ext3->blkfront->blkback->lvm->dm-crypt->disk chain leading the > filesystem to eventually corrupt itself. > > The issue seems to relate to the use of dm-crypt since > ext3->blkfront->blkback->lvm->disk is reported work fine. > > However there is no problem with the local dom0 ext3 root filesystem > which is also in the same lvm VG on the crypt device (i.e. > ext3->lvm->dm-crypt->disk), so its not purely a dm-crypt issue. I figure > something is up at the blkfront->back link which causes the barriers > which blkback is injecting into the block subsystem either don''t make it > to the dm-crypt layer or do not DTRT once they arrive. > > I''m not really sure with how to proceed (or how to ask Anthony to > proceed) with verifying any part of that hypothesis though. > > ISTR issues with old vs new style barriers or barriers with no data in > them or something, could this be related to that? (or am I thinking of > DISCARD?) > > The issue was initially reported with Squeeze (Jeremy 2.6.32 tree) domU > on a Wheezy (mainline 3.2) dom0 but IIRC has also been repeated with > Wheezy on Wheezy now so this isn''t cross version confusion about barrier > semantics AFAICT.Hello, I''ve been trying to reproduce this issue, but so far I haven''t been able to. I guess I''m missing something, so here are the steps I followed: First, I''ve created a primary partition in my HDD, it''s sda3, and then I''ve executed the following in order to encrypt it and setup the lvm: # cryptsetup luksFormat /dev/sda3 # cryptsetup luksOpen /dev/sda3 crypt # pvcreate /dev/mapper/crypt # vgcreate crypt /dev/mapper/crypt # lvcreate -L 20G crypt -n debian That gives me a block device /dev/crypt/debian, that I''m attaching to a Debian DomU as xvdb, I''ve created a partition to fill the whole disk and formatted it inside the guest using mkfs.ext3. Then, inside the guest, I''ve scp''ed a 10G file from a remote host, and checked the checksum, everything OK. So far, I''ve tested with a Dom0 kernel 3.2.0-0.bpo.4-amd64 and a DomU kernel 3.2.0-0.bpo.4-amd64 and 2.6.32-5-xen-amd64, both tests where OK. Regards, Roger.
Missed a reply-all... I would guess the difference is I am using LVM with full disk encryption. Take a look at http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=705124 for the details on exactly how I am able to recreate this bug. In other words, I use the installer and chose the option to use full disk encryption and LVM. I''ll be starting with the rest of the testing and data collection which was requested shortly. On Fri, May 24, 2013 at 1:48 PM, Roger Pau Monné <roger.pau@citrix.com> wrote:> On 17/04/13 15:00, Ian Campbell wrote: >> On Tue, 2013-04-16 at 18:39 +0100, Anthony Sheetz wrote: >>> (re-sending, first message seems to have gotten lost) >>> >>> I was referred here by Ian Campbell ijc@hellion.org.uk from bugs.debian.org. >> >> I''m here too (different hat ;-)), thanks for posting it here. I''ve added >> some people who know about the block stuff to the CC. >> >> Guys, my suspicion is that the issue is that barriers issued by ext3 >> inside the guest aren''t making it all the way down the >> ext3->blkfront->blkback->lvm->dm-crypt->disk chain leading the >> filesystem to eventually corrupt itself. >> >> The issue seems to relate to the use of dm-crypt since >> ext3->blkfront->blkback->lvm->disk is reported work fine. >> >> However there is no problem with the local dom0 ext3 root filesystem >> which is also in the same lvm VG on the crypt device (i.e. >> ext3->lvm->dm-crypt->disk), so its not purely a dm-crypt issue. I figure >> something is up at the blkfront->back link which causes the barriers >> which blkback is injecting into the block subsystem either don''t make it >> to the dm-crypt layer or do not DTRT once they arrive. >> >> I''m not really sure with how to proceed (or how to ask Anthony to >> proceed) with verifying any part of that hypothesis though. >> >> ISTR issues with old vs new style barriers or barriers with no data in >> them or something, could this be related to that? (or am I thinking of >> DISCARD?) >> >> The issue was initially reported with Squeeze (Jeremy 2.6.32 tree) domU >> on a Wheezy (mainline 3.2) dom0 but IIRC has also been repeated with >> Wheezy on Wheezy now so this isn''t cross version confusion about barrier >> semantics AFAICT. > > Hello, > > I''ve been trying to reproduce this issue, but so far I haven''t been able > to. I guess I''m missing something, so here are the steps I followed: > > First, I''ve created a primary partition in my HDD, it''s sda3, and then > I''ve executed the following in order to encrypt it and setup the lvm: > > # cryptsetup luksFormat /dev/sda3 > # cryptsetup luksOpen /dev/sda3 crypt > # pvcreate /dev/mapper/crypt > # vgcreate crypt /dev/mapper/crypt > # lvcreate -L 20G crypt -n debian > > That gives me a block device /dev/crypt/debian, that I''m attaching to a > Debian DomU as xvdb, I''ve created a partition to fill the whole disk and > formatted it inside the guest using mkfs.ext3. > > Then, inside the guest, I''ve scp''ed a 10G file from a remote host, and > checked the checksum, everything OK. So far, I''ve tested with a Dom0 > kernel 3.2.0-0.bpo.4-amd64 and a DomU kernel 3.2.0-0.bpo.4-amd64 and > 2.6.32-5-xen-amd64, both tests where OK. > > Regards, Roger.
On 28/05/13 14:10, Anthony Sheetz wrote:> Missed a reply-all... > > I would guess the difference is I am using LVM with full disk > encryption. Take a look at > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=705124 for the > details on exactly how I am able to recreate this bug. > In other words, I use the installer and chose the option to use full > disk encryption and LVM. > I''ll be starting with the rest of the testing and data collection > which was requested shortly.I would like to avoid reinstalling my whole OS, and I don''t have a spare HDD, so isn''t there anyway I can reproduce the full disk encryption using a partition?> > On Fri, May 24, 2013 at 1:48 PM, Roger Pau Monné <roger.pau@citrix.com> wrote: >> On 17/04/13 15:00, Ian Campbell wrote: >>> On Tue, 2013-04-16 at 18:39 +0100, Anthony Sheetz wrote: >>>> (re-sending, first message seems to have gotten lost) >>>> >>>> I was referred here by Ian Campbell ijc@hellion.org.uk from bugs.debian.org. >>> >>> I''m here too (different hat ;-)), thanks for posting it here. I''ve added >>> some people who know about the block stuff to the CC. >>> >>> Guys, my suspicion is that the issue is that barriers issued by ext3 >>> inside the guest aren''t making it all the way down the >>> ext3->blkfront->blkback->lvm->dm-crypt->disk chain leading the >>> filesystem to eventually corrupt itself. >>> >>> The issue seems to relate to the use of dm-crypt since >>> ext3->blkfront->blkback->lvm->disk is reported work fine. >>> >>> However there is no problem with the local dom0 ext3 root filesystem >>> which is also in the same lvm VG on the crypt device (i.e. >>> ext3->lvm->dm-crypt->disk), so its not purely a dm-crypt issue. I figure >>> something is up at the blkfront->back link which causes the barriers >>> which blkback is injecting into the block subsystem either don''t make it >>> to the dm-crypt layer or do not DTRT once they arrive. >>> >>> I''m not really sure with how to proceed (or how to ask Anthony to >>> proceed) with verifying any part of that hypothesis though. >>> >>> ISTR issues with old vs new style barriers or barriers with no data in >>> them or something, could this be related to that? (or am I thinking of >>> DISCARD?) >>> >>> The issue was initially reported with Squeeze (Jeremy 2.6.32 tree) domU >>> on a Wheezy (mainline 3.2) dom0 but IIRC has also been repeated with >>> Wheezy on Wheezy now so this isn''t cross version confusion about barrier >>> semantics AFAICT. >> >> Hello, >> >> I''ve been trying to reproduce this issue, but so far I haven''t been able >> to. I guess I''m missing something, so here are the steps I followed: >> >> First, I''ve created a primary partition in my HDD, it''s sda3, and then >> I''ve executed the following in order to encrypt it and setup the lvm: >> >> # cryptsetup luksFormat /dev/sda3 >> # cryptsetup luksOpen /dev/sda3 crypt >> # pvcreate /dev/mapper/crypt >> # vgcreate crypt /dev/mapper/crypt >> # lvcreate -L 20G crypt -n debian >> >> That gives me a block device /dev/crypt/debian, that I''m attaching to a >> Debian DomU as xvdb, I''ve created a partition to fill the whole disk and >> formatted it inside the guest using mkfs.ext3. >> >> Then, inside the guest, I''ve scp''ed a 10G file from a remote host, and >> checked the checksum, everything OK. So far, I''ve tested with a Dom0 >> kernel 3.2.0-0.bpo.4-amd64 and a DomU kernel 3.2.0-0.bpo.4-amd64 and >> 2.6.32-5-xen-amd64, both tests where OK. >> >> Regards, Roger.
> Thank you. Also when you do this test, could you also provide the ''xenstore-ls'' > output from dom0? And the ''dmesg'' output from the guest (or at least > the ''xl console <guest> | tee /tmp/log'' ? That would give me and idea if > the frontend/backend have the right negotiation parameters.Attached is the output of xenstore-ls from dom0, and dmesg from a domU with kernel 2.6.32-5-xen-amd64 Will be working on putting a 3.2 kernel in place next, testing file transfer, and adding the output of dmesg from that. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On Tue, May 28, 2013 at 10:27 AM, Anthony Sheetz <sheetzam@inspire.com> wrote:>> Thank you. Also when you do this test, could you also provide the ''xenstore-ls'' >> output from dom0? And the ''dmesg'' output from the guest (or at least >> the ''xl console <guest> | tee /tmp/log'' ? That would give me and idea if >> the frontend/backend have the right negotiation parameters. > > Attached is the output of xenstore-ls from dom0, and dmesg from a domU > with kernel 2.6.32-5-xen-amd64 > Will be working on putting a 3.2 kernel in place next, testing file > transfer, and adding the output of dmesg from that.updated to 3.2 using http://www.cyberciti.biz/faq/debian-linux-6-apt-get-install-linux-kernel-3-2/ for instructions. During transfer of data saw this: BUG" scheduling while atomic: kworker/0:2/10421/0x10000002 Transfer test resulted in a file which did not match md5sum. Attached is the dmesg output from the domU. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
>> I would guess the difference is I am using LVM with full disk >> encryption. Take a look at >> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=705124 for the >> details on exactly how I am able to recreate this bug. >> In other words, I use the installer and chose the option to use full >> disk encryption and LVM. >> I''ll be starting with the rest of the testing and data collection >> which was requested shortly. > > I would like to avoid reinstalling my whole OS, and I don''t have a spare > HDD, so isn''t there anyway I can reproduce the full disk encryption > using a partition?As my colleague points out, the set up you have misses that a single encrypted object is in use by both dom0 and domU. Without having your dom0 on the same encrypted device as your domU (even though they use different logical volumes) I''m not sure how to test it.
On Tue, May 28, 2013 at 02:02:41PM -0400, Anthony Sheetz wrote:> On Tue, May 28, 2013 at 10:27 AM, Anthony Sheetz <sheetzam@inspire.com> wrote: > >> Thank you. Also when you do this test, could you also provide the ''xenstore-ls'' > >> output from dom0? And the ''dmesg'' output from the guest (or at least > >> the ''xl console <guest> | tee /tmp/log'' ? That would give me and idea if > >> the frontend/backend have the right negotiation parameters. > > > > Attached is the output of xenstore-ls from dom0, and dmesg from a domU > > with kernel 2.6.32-5-xen-amd64 > > Will be working on putting a 3.2 kernel in place next, testing file > > transfer, and adding the output of dmesg from that. > > updated to 3.2 using > http://www.cyberciti.biz/faq/debian-linux-6-apt-get-install-linux-kernel-3-2/ > for instructions. > During transfer of data saw this: BUG" scheduling while atomic: > kworker/0:2/10421/0x10000002? I don''t see it here?> Transfer test resulted in a file which did not match md5sum. Attached > is the dmesg output from the domU.Shouldn''t the BUG be present here?> [ 0.000000] Initializing cgroup subsys cpuset > [ 0.000000] Initializing cgroup subsys cpu > [ 0.000000] Linux version 3.2.0-0.bpo.4-amd64 (debian-kernel@lists.debian.org) (gcc version 4.4.5 (Debian 4.4.5-8) ) #1 SMP Debian 3.2.41-2+deb7u2~bpo60+1 > [ 0.000000] Command line: root=/dev/xvda2 ro > [ 0.000000] ACPI in unprivileged domain disabled > [ 0.000000] Released 0 pages of unused memory > [ 0.000000] Set 0 page(s) to 1-1 mapping > [ 0.000000] BIOS-provided physical RAM map: > [ 0.000000] Xen: 0000000000000000 - 00000000000a0000 (usable) > [ 0.000000] Xen: 00000000000a0000 - 0000000000100000 (reserved) > [ 0.000000] Xen: 0000000000100000 - 0000000060800000 (usable) > [ 0.000000] NX (Execute Disable) protection: active > [ 0.000000] DMI not present or invalid. > [ 0.000000] e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved) > [ 0.000000] e820 remove range: 00000000000a0000 - 0000000000100000 (usable) > [ 0.000000] No AGP bridge found > [ 0.000000] last_pfn = 0x60800 max_arch_pfn = 0x400000000 > [ 0.000000] initial memory mapped : 0 - 03639000 > [ 0.000000] Base memory trampoline at [ffff88000009b000] 9b000 size 20480 > [ 0.000000] init_memory_mapping: 0000000000000000-0000000060800000 > [ 0.000000] 0000000000 - 0060800000 page 4k > [ 0.000000] kernel direct mapping tables up to 60800000 @ cf9000-1000000 > [ 0.000000] xen: setting RW the range fdc000 - 1000000 > [ 0.000000] RAMDISK: 01949000 - 03639000 > [ 0.000000] NUMA turned off > [ 0.000000] Faking a node at 0000000000000000-0000000060800000 > [ 0.000000] Initmem setup node 0 0000000000000000-0000000060800000 > [ 0.000000] NODE_DATA [000000005fffb000 - 000000005fffffff] > [ 0.000000] Zone PFN ranges: > [ 0.000000] DMA 0x00000010 -> 0x00001000 > [ 0.000000] DMA32 0x00001000 -> 0x00100000 > [ 0.000000] Normal empty > [ 0.000000] Movable zone start PFN for each node > [ 0.000000] early_node_map[2] active PFN ranges > [ 0.000000] 0: 0x00000010 -> 0x000000a0 > [ 0.000000] 0: 0x00000100 -> 0x00060800 > [ 0.000000] On node 0 totalpages: 395152 > [ 0.000000] DMA zone: 56 pages used for memmap > [ 0.000000] DMA zone: 744 pages reserved > [ 0.000000] DMA zone: 3184 pages, LIFO batch:0 > [ 0.000000] DMA32 zone: 5348 pages used for memmap > [ 0.000000] DMA32 zone: 385820 pages, LIFO batch:31 > [ 0.000000] SFI: Simple Firmware Interface v0.81 http://simplefirmware.org > [ 0.000000] SMP: Allowing 1 CPUs, 0 hotplug CPUs > [ 0.000000] No local APIC present > [ 0.000000] APIC: disable apic facility > [ 0.000000] APIC: switched to apic NOOP > [ 0.000000] nr_irqs_gsi: 16 > [ 0.000000] PM: Registered nosave memory: 00000000000a0000 - 0000000000100000 > [ 0.000000] Allocating PCI resources starting at 60800000 (gap: 60800000:9f800000) > [ 0.000000] Booting paravirtualized kernel on Xen > [ 0.000000] Xen version: 4.1.4 (preserve-AD) > [ 0.000000] setup_percpu: NR_CPUS:512 nr_cpumask_bits:512 nr_cpu_ids:1 nr_node_ids:1 > [ 0.000000] PERCPU: Embedded 28 pages/cpu @ffff88005fc00000 s82880 r8192 d23616 u2097152 > [ 0.000000] pcpu-alloc: s82880 r8192 d23616 u2097152 alloc=1*2097152 > [ 0.000000] pcpu-alloc: [0] 0 > [ 0.000000] Built 1 zonelists in Node order, mobility grouping on. Total pages: 389004 > [ 0.000000] Policy zone: DMA32 > [ 0.000000] Kernel command line: root=/dev/xvda2 ro > [ 0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes) > [ 0.000000] Checking aperture... > [ 0.000000] No AGP bridge found > [ 0.000000] Calgary: detecting Calgary via BIOS EBDA area > [ 0.000000] Calgary: Unable to locate Rio Grande table in EBDA - bailing! > [ 0.000000] Memory: 1504508k/1581056k available (3531k kernel code, 448k absent, 76100k reserved, 3208k data, 616k init) > [ 0.000000] Hierarchical RCU implementation. > [ 0.000000] RCU dyntick-idle grace-period acceleration is enabled. > [ 0.000000] NR_IRQS:33024 nr_irqs:256 16 > [ 0.000000] Console: colour dummy device 80x25 > [ 0.000000] console [tty0] enabled > [ 0.000000] console [hvc0] enabled > [ 0.000000] Xen: using vcpuop timer interface > [ 0.000000] installing Xen timer for CPU 0 > [ 0.000000] Detected 2294.848 MHz processor. > [ 0.004000] Calibrating delay loop (skipped), value calculated using timer frequency.. 4589.69 BogoMIPS (lpj=9179392) > [ 0.004000] pid_max: default: 32768 minimum: 301 > [ 0.004000] Security Framework initialized > [ 0.004000] AppArmor: AppArmor disabled by boot time parameter > [ 0.004000] Dentry cache hash table entries: 262144 (order: 9, 2097152 bytes) > [ 0.004000] Inode-cache hash table entries: 131072 (order: 8, 1048576 bytes) > [ 0.004000] Mount-cache hash table entries: 256 > [ 0.004000] Initializing cgroup subsys cpuacct > [ 0.004000] Initializing cgroup subsys memory > [ 0.004000] Initializing cgroup subsys devices > [ 0.004000] Initializing cgroup subsys freezer > [ 0.004000] Initializing cgroup subsys net_cls > [ 0.004000] Initializing cgroup subsys blkio > [ 0.004000] Initializing cgroup subsys perf_event > [ 0.004000] ENERGY_PERF_BIAS: Set to ''normal'', was ''performance'' > [ 0.004000] ENERGY_PERF_BIAS: View and update with x86_energy_perf_policy(8) > [ 0.004000] CPU: Physical Processor ID: 0 > [ 0.004000] CPU: Processor Core ID: 0 > [ 0.004000] SMP alternatives: switching to UP code > [ 0.029088] Freeing SMP alternatives: 16k freed > [ 0.029163] Performance Events: unsupported p6 CPU model 58 no PMU driver, software events only. > [ 0.029293] NMI watchdog disabled (cpu0): hardware events not enabled > [ 0.029318] Brought up 1 CPUs > [ 0.029448] devtmpfs: initialized > [ 0.032173] Grant table initialized > [ 0.032244] print_constraints: dummy: > [ 0.032305] NET: Registered protocol family 16 > [ 0.032510] PCI: setting up Xen PCI frontend stub > [ 0.032517] PCI: pci_cache_line_size set to 64 bytes > [ 0.033015] bio: create slab <bio-0> at 0 > [ 0.033078] ACPI: Interpreter disabled. > [ 0.033098] xen/balloon: Initialising balloon driver. > [ 0.033098] xen-balloon: Initialising balloon driver. > [ 0.033098] vgaarb: loaded > [ 0.033098] PCI: System does not support PCI > [ 0.033098] PCI: System does not support PCI > [ 0.033098] Switching to clocksource xen > [ 0.033194] pnp: PnP ACPI: disabled > [ 0.034979] PCI: max bus depth: 0 pci_try_num: 1 > [ 0.035010] NET: Registered protocol family 2 > [ 0.035175] IP route cache hash table entries: 65536 (order: 7, 524288 bytes) > [ 0.036322] TCP established hash table entries: 262144 (order: 10, 4194304 bytes) > [ 0.037073] TCP bind hash table entries: 65536 (order: 8, 1048576 bytes) > [ 0.037188] TCP: Hash tables configured (established 262144 bind 65536) > [ 0.037193] TCP reno registered > [ 0.037207] UDP hash table entries: 1024 (order: 3, 32768 bytes) > [ 0.037225] UDP-Lite hash table entries: 1024 (order: 3, 32768 bytes) > [ 0.037284] NET: Registered protocol family 1 > [ 0.037292] PCI: CLS 0 bytes, default 64 > [ 0.037327] Unpacking initramfs... > [ 0.061808] Freeing initrd memory: 29632k freed > [ 0.067281] platform rtc_cmos: registered platform RTC device (no PNP device found) > [ 0.067460] audit: initializing netlink socket (disabled) > [ 0.067471] type=2000 audit(1369752979.409:1): initialized > [ 0.080739] HugeTLB registered 2 MB page size, pre-allocated 0 pages > [ 0.080910] VFS: Disk quotas dquot_6.5.2 > [ 0.080931] Dquot-cache hash table entries: 512 (order 0, 4096 bytes) > [ 0.080980] msgmni has been set to 2996 > [ 0.081099] alg: No test for stdrng (krng) > [ 0.081120] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 253) > [ 0.081126] io scheduler noop registered > [ 0.081129] io scheduler deadline registered > [ 0.081140] io scheduler cfq registered (default) > [ 0.081183] pci_hotplug: PCI Hot Plug PCI Core version: 0.5 > [ 0.081202] pciehp: PCI Express Hot Plug Controller Driver version: 0.4 > [ 0.081206] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5 > [ 0.197788] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled > [ 0.198048] Linux agpgart interface v0.103 > [ 0.198133] i8042: PNP: No PS/2 controller found. Probing ports directly. > [ 1.200733] i8042: No controller found > [ 1.200830] mousedev: PS/2 mouse device common for all mice > [ 1.240666] rtc_cmos rtc_cmos: rtc core: registered rtc_cmos as rtc0 > [ 1.240721] rtc_cmos: probe of rtc_cmos failed with error -38 > [ 1.240885] TCP cubic registered > [ 1.240933] NET: Registered protocol family 10 > [ 1.241267] Mobile IPv6 > [ 1.241274] NET: Registered protocol family 17 > [ 1.241283] Registering the dns_resolver key type > [ 1.241388] PM: Hibernation image not present or could not be loaded. > [ 1.241395] registered taskstats version 1 > [ 1.241410] XENBUS: Device with no driver: device/vbd/51714 > [ 1.241416] XENBUS: Device with no driver: device/vbd/51713 > [ 1.241420] XENBUS: Device with no driver: device/vif/0 > [ 1.241425] XENBUS: Device with no driver: device/console/0 > [ 1.241442] /build/buildd-linux_3.2.41-2+deb7u2~bpo60+1-amd64-mnypfK/linux-3.2.41/drivers/rtc/hctosys.c: unable to open rtc device (rtc0) > [ 1.241476] Initializing network drop monitor service > [ 1.241791] Freeing unused kernel memory: 616k freed > [ 1.241910] Write protecting the kernel read-only data: 6144k > [ 1.244660] Freeing unused kernel memory: 548k freed > [ 1.245050] Freeing unused kernel memory: 708k freed > [ 1.276238] udev[45]: starting version 164 > [ 1.312147] Initialising Xen virtual ethernet driver. > [ 1.327497] blkfront: xvda2: flush diskcache: enabled > [ 1.331984] blkfront: xvda1: flush diskcache: enabled > [ 1.667213] kjournald starting. Commit interval 5 seconds > [ 1.667240] EXT3-fs (xvda2): mounted filesystem with ordered data mode > [ 2.738037] udev[140]: starting version 164 > [ 3.172340] input: PC Speaker as /devices/platform/pcspkr/input/input0 > [ 3.296421] alg: No test for __gcm-aes-aesni (__driver-gcm-aes-aesni) > [ 3.660850] Error: Driver ''pcspkr'' is already registered, aborting... > [ 3.965481] Adding 262140k swap on /dev/xvda1. Priority:-1 extents:1 across:262140k SS > [ 4.075322] EXT3-fs (xvda2): using internal journal > [ 5.839480] sshd (534): /proc/534/oom_adj is deprecated, please use /proc/534/oom_score_adj instead. > [ 15.408035] eth0: no IPv6 routers present
I''d have thought so as well. It''s possible that was console output from dom0, come to think of it. On Tue, May 28, 2013 at 2:18 PM, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:> On Tue, May 28, 2013 at 02:02:41PM -0400, Anthony Sheetz wrote: >> On Tue, May 28, 2013 at 10:27 AM, Anthony Sheetz <sheetzam@inspire.com> wrote: >> >> Thank you. Also when you do this test, could you also provide the ''xenstore-ls'' >> >> output from dom0? And the ''dmesg'' output from the guest (or at least >> >> the ''xl console <guest> | tee /tmp/log'' ? That would give me and idea if >> >> the frontend/backend have the right negotiation parameters. >> > >> > Attached is the output of xenstore-ls from dom0, and dmesg from a domU >> > with kernel 2.6.32-5-xen-amd64 >> > Will be working on putting a 3.2 kernel in place next, testing file >> > transfer, and adding the output of dmesg from that. >> >> updated to 3.2 using >> http://www.cyberciti.biz/faq/debian-linux-6-apt-get-install-linux-kernel-3-2/ >> for instructions. >> During transfer of data saw this: BUG" scheduling while atomic: >> kworker/0:2/10421/0x10000002 > > ? I don''t see it here? >> Transfer test resulted in a file which did not match md5sum. Attached >> is the dmesg output from the domU. > > Shouldn''t the BUG be present here? > >> [ 0.000000] Initializing cgroup subsys cpuset >> [ 0.000000] Initializing cgroup subsys cpu >> [ 0.000000] Linux version 3.2.0-0.bpo.4-amd64 (debian-kernel@lists.debian.org) (gcc version 4.4.5 (Debian 4.4.5-8) ) #1 SMP Debian 3.2.41-2+deb7u2~bpo60+1 >> [ 0.000000] Command line: root=/dev/xvda2 ro >> [ 0.000000] ACPI in unprivileged domain disabled >> [ 0.000000] Released 0 pages of unused memory >> [ 0.000000] Set 0 page(s) to 1-1 mapping >> [ 0.000000] BIOS-provided physical RAM map: >> [ 0.000000] Xen: 0000000000000000 - 00000000000a0000 (usable) >> [ 0.000000] Xen: 00000000000a0000 - 0000000000100000 (reserved) >> [ 0.000000] Xen: 0000000000100000 - 0000000060800000 (usable) >> [ 0.000000] NX (Execute Disable) protection: active >> [ 0.000000] DMI not present or invalid. >> [ 0.000000] e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved) >> [ 0.000000] e820 remove range: 00000000000a0000 - 0000000000100000 (usable) >> [ 0.000000] No AGP bridge found >> [ 0.000000] last_pfn = 0x60800 max_arch_pfn = 0x400000000 >> [ 0.000000] initial memory mapped : 0 - 03639000 >> [ 0.000000] Base memory trampoline at [ffff88000009b000] 9b000 size 20480 >> [ 0.000000] init_memory_mapping: 0000000000000000-0000000060800000 >> [ 0.000000] 0000000000 - 0060800000 page 4k >> [ 0.000000] kernel direct mapping tables up to 60800000 @ cf9000-1000000 >> [ 0.000000] xen: setting RW the range fdc000 - 1000000 >> [ 0.000000] RAMDISK: 01949000 - 03639000 >> [ 0.000000] NUMA turned off >> [ 0.000000] Faking a node at 0000000000000000-0000000060800000 >> [ 0.000000] Initmem setup node 0 0000000000000000-0000000060800000 >> [ 0.000000] NODE_DATA [000000005fffb000 - 000000005fffffff] >> [ 0.000000] Zone PFN ranges: >> [ 0.000000] DMA 0x00000010 -> 0x00001000 >> [ 0.000000] DMA32 0x00001000 -> 0x00100000 >> [ 0.000000] Normal empty >> [ 0.000000] Movable zone start PFN for each node >> [ 0.000000] early_node_map[2] active PFN ranges >> [ 0.000000] 0: 0x00000010 -> 0x000000a0 >> [ 0.000000] 0: 0x00000100 -> 0x00060800 >> [ 0.000000] On node 0 totalpages: 395152 >> [ 0.000000] DMA zone: 56 pages used for memmap >> [ 0.000000] DMA zone: 744 pages reserved >> [ 0.000000] DMA zone: 3184 pages, LIFO batch:0 >> [ 0.000000] DMA32 zone: 5348 pages used for memmap >> [ 0.000000] DMA32 zone: 385820 pages, LIFO batch:31 >> [ 0.000000] SFI: Simple Firmware Interface v0.81 http://simplefirmware.org >> [ 0.000000] SMP: Allowing 1 CPUs, 0 hotplug CPUs >> [ 0.000000] No local APIC present >> [ 0.000000] APIC: disable apic facility >> [ 0.000000] APIC: switched to apic NOOP >> [ 0.000000] nr_irqs_gsi: 16 >> [ 0.000000] PM: Registered nosave memory: 00000000000a0000 - 0000000000100000 >> [ 0.000000] Allocating PCI resources starting at 60800000 (gap: 60800000:9f800000) >> [ 0.000000] Booting paravirtualized kernel on Xen >> [ 0.000000] Xen version: 4.1.4 (preserve-AD) >> [ 0.000000] setup_percpu: NR_CPUS:512 nr_cpumask_bits:512 nr_cpu_ids:1 nr_node_ids:1 >> [ 0.000000] PERCPU: Embedded 28 pages/cpu @ffff88005fc00000 s82880 r8192 d23616 u2097152 >> [ 0.000000] pcpu-alloc: s82880 r8192 d23616 u2097152 alloc=1*2097152 >> [ 0.000000] pcpu-alloc: [0] 0 >> [ 0.000000] Built 1 zonelists in Node order, mobility grouping on. Total pages: 389004 >> [ 0.000000] Policy zone: DMA32 >> [ 0.000000] Kernel command line: root=/dev/xvda2 ro >> [ 0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes) >> [ 0.000000] Checking aperture... >> [ 0.000000] No AGP bridge found >> [ 0.000000] Calgary: detecting Calgary via BIOS EBDA area >> [ 0.000000] Calgary: Unable to locate Rio Grande table in EBDA - bailing! >> [ 0.000000] Memory: 1504508k/1581056k available (3531k kernel code, 448k absent, 76100k reserved, 3208k data, 616k init) >> [ 0.000000] Hierarchical RCU implementation. >> [ 0.000000] RCU dyntick-idle grace-period acceleration is enabled. >> [ 0.000000] NR_IRQS:33024 nr_irqs:256 16 >> [ 0.000000] Console: colour dummy device 80x25 >> [ 0.000000] console [tty0] enabled >> [ 0.000000] console [hvc0] enabled >> [ 0.000000] Xen: using vcpuop timer interface >> [ 0.000000] installing Xen timer for CPU 0 >> [ 0.000000] Detected 2294.848 MHz processor. >> [ 0.004000] Calibrating delay loop (skipped), value calculated using timer frequency.. 4589.69 BogoMIPS (lpj=9179392) >> [ 0.004000] pid_max: default: 32768 minimum: 301 >> [ 0.004000] Security Framework initialized >> [ 0.004000] AppArmor: AppArmor disabled by boot time parameter >> [ 0.004000] Dentry cache hash table entries: 262144 (order: 9, 2097152 bytes) >> [ 0.004000] Inode-cache hash table entries: 131072 (order: 8, 1048576 bytes) >> [ 0.004000] Mount-cache hash table entries: 256 >> [ 0.004000] Initializing cgroup subsys cpuacct >> [ 0.004000] Initializing cgroup subsys memory >> [ 0.004000] Initializing cgroup subsys devices >> [ 0.004000] Initializing cgroup subsys freezer >> [ 0.004000] Initializing cgroup subsys net_cls >> [ 0.004000] Initializing cgroup subsys blkio >> [ 0.004000] Initializing cgroup subsys perf_event >> [ 0.004000] ENERGY_PERF_BIAS: Set to ''normal'', was ''performance'' >> [ 0.004000] ENERGY_PERF_BIAS: View and update with x86_energy_perf_policy(8) >> [ 0.004000] CPU: Physical Processor ID: 0 >> [ 0.004000] CPU: Processor Core ID: 0 >> [ 0.004000] SMP alternatives: switching to UP code >> [ 0.029088] Freeing SMP alternatives: 16k freed >> [ 0.029163] Performance Events: unsupported p6 CPU model 58 no PMU driver, software events only. >> [ 0.029293] NMI watchdog disabled (cpu0): hardware events not enabled >> [ 0.029318] Brought up 1 CPUs >> [ 0.029448] devtmpfs: initialized >> [ 0.032173] Grant table initialized >> [ 0.032244] print_constraints: dummy: >> [ 0.032305] NET: Registered protocol family 16 >> [ 0.032510] PCI: setting up Xen PCI frontend stub >> [ 0.032517] PCI: pci_cache_line_size set to 64 bytes >> [ 0.033015] bio: create slab <bio-0> at 0 >> [ 0.033078] ACPI: Interpreter disabled. >> [ 0.033098] xen/balloon: Initialising balloon driver. >> [ 0.033098] xen-balloon: Initialising balloon driver. >> [ 0.033098] vgaarb: loaded >> [ 0.033098] PCI: System does not support PCI >> [ 0.033098] PCI: System does not support PCI >> [ 0.033098] Switching to clocksource xen >> [ 0.033194] pnp: PnP ACPI: disabled >> [ 0.034979] PCI: max bus depth: 0 pci_try_num: 1 >> [ 0.035010] NET: Registered protocol family 2 >> [ 0.035175] IP route cache hash table entries: 65536 (order: 7, 524288 bytes) >> [ 0.036322] TCP established hash table entries: 262144 (order: 10, 4194304 bytes) >> [ 0.037073] TCP bind hash table entries: 65536 (order: 8, 1048576 bytes) >> [ 0.037188] TCP: Hash tables configured (established 262144 bind 65536) >> [ 0.037193] TCP reno registered >> [ 0.037207] UDP hash table entries: 1024 (order: 3, 32768 bytes) >> [ 0.037225] UDP-Lite hash table entries: 1024 (order: 3, 32768 bytes) >> [ 0.037284] NET: Registered protocol family 1 >> [ 0.037292] PCI: CLS 0 bytes, default 64 >> [ 0.037327] Unpacking initramfs... >> [ 0.061808] Freeing initrd memory: 29632k freed >> [ 0.067281] platform rtc_cmos: registered platform RTC device (no PNP device found) >> [ 0.067460] audit: initializing netlink socket (disabled) >> [ 0.067471] type=2000 audit(1369752979.409:1): initialized >> [ 0.080739] HugeTLB registered 2 MB page size, pre-allocated 0 pages >> [ 0.080910] VFS: Disk quotas dquot_6.5.2 >> [ 0.080931] Dquot-cache hash table entries: 512 (order 0, 4096 bytes) >> [ 0.080980] msgmni has been set to 2996 >> [ 0.081099] alg: No test for stdrng (krng) >> [ 0.081120] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 253) >> [ 0.081126] io scheduler noop registered >> [ 0.081129] io scheduler deadline registered >> [ 0.081140] io scheduler cfq registered (default) >> [ 0.081183] pci_hotplug: PCI Hot Plug PCI Core version: 0.5 >> [ 0.081202] pciehp: PCI Express Hot Plug Controller Driver version: 0.4 >> [ 0.081206] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5 >> [ 0.197788] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled >> [ 0.198048] Linux agpgart interface v0.103 >> [ 0.198133] i8042: PNP: No PS/2 controller found. Probing ports directly. >> [ 1.200733] i8042: No controller found >> [ 1.200830] mousedev: PS/2 mouse device common for all mice >> [ 1.240666] rtc_cmos rtc_cmos: rtc core: registered rtc_cmos as rtc0 >> [ 1.240721] rtc_cmos: probe of rtc_cmos failed with error -38 >> [ 1.240885] TCP cubic registered >> [ 1.240933] NET: Registered protocol family 10 >> [ 1.241267] Mobile IPv6 >> [ 1.241274] NET: Registered protocol family 17 >> [ 1.241283] Registering the dns_resolver key type >> [ 1.241388] PM: Hibernation image not present or could not be loaded. >> [ 1.241395] registered taskstats version 1 >> [ 1.241410] XENBUS: Device with no driver: device/vbd/51714 >> [ 1.241416] XENBUS: Device with no driver: device/vbd/51713 >> [ 1.241420] XENBUS: Device with no driver: device/vif/0 >> [ 1.241425] XENBUS: Device with no driver: device/console/0 >> [ 1.241442] /build/buildd-linux_3.2.41-2+deb7u2~bpo60+1-amd64-mnypfK/linux-3.2.41/drivers/rtc/hctosys.c: unable to open rtc device (rtc0) >> [ 1.241476] Initializing network drop monitor service >> [ 1.241791] Freeing unused kernel memory: 616k freed >> [ 1.241910] Write protecting the kernel read-only data: 6144k >> [ 1.244660] Freeing unused kernel memory: 548k freed >> [ 1.245050] Freeing unused kernel memory: 708k freed >> [ 1.276238] udev[45]: starting version 164 >> [ 1.312147] Initialising Xen virtual ethernet driver. >> [ 1.327497] blkfront: xvda2: flush diskcache: enabled >> [ 1.331984] blkfront: xvda1: flush diskcache: enabled >> [ 1.667213] kjournald starting. Commit interval 5 seconds >> [ 1.667240] EXT3-fs (xvda2): mounted filesystem with ordered data mode >> [ 2.738037] udev[140]: starting version 164 >> [ 3.172340] input: PC Speaker as /devices/platform/pcspkr/input/input0 >> [ 3.296421] alg: No test for __gcm-aes-aesni (__driver-gcm-aes-aesni) >> [ 3.660850] Error: Driver ''pcspkr'' is already registered, aborting... >> [ 3.965481] Adding 262140k swap on /dev/xvda1. Priority:-1 extents:1 across:262140k SS >> [ 4.075322] EXT3-fs (xvda2): using internal journal >> [ 5.839480] sshd (534): /proc/534/oom_adj is deprecated, please use /proc/534/oom_score_adj instead. >> [ 15.408035] eth0: no IPv6 routers present >
On Tue, 2013-05-28 at 14:15 -0400, Anthony Sheetz wrote:> >> I would guess the difference is I am using LVM with full disk > >> encryption. Take a look at > >> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=705124 for the > >> details on exactly how I am able to recreate this bug. > >> In other words, I use the installer and chose the option to use full > >> disk encryption and LVM. > >> I''ll be starting with the rest of the testing and data collection > >> which was requested shortly. > > > > I would like to avoid reinstalling my whole OS, and I don''t have a spare > > HDD, so isn''t there anyway I can reproduce the full disk encryption > > using a partition? > > As my colleague points out, the set up you have misses that a single > encrypted object is in use by both dom0 and domU. Without having your > dom0 on the same encrypted device as your domU (even though they use > different logical volumes) I''m not sure how to test it.Perhaps you could install a second dom0 rootfs on the LVM partition and use that for testing. This would at least avoid blowing away the original "primary" dom0 rootfs, which I suppose is what Roger would like to avoid. Ian.
Is there anything else I can get you at this time to help troubleshoot this? On Fri, May 24, 2013 at 10:20 AM, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:> On Thu, May 23, 2013 at 02:19:50PM -0400, Anthony Sheetz wrote: >> On Wed, May 22, 2013 at 4:10 PM, Konrad Rzeszutek Wilk >> <konrad.wilk@oracle.com> wrote: >> > On Mon, Apr 22, 2013 at 01:26:34PM +0100, Ian Campbell wrote: >> >> Konrad is on vacation this week, so it''ll probably be next week before >> >> this gets looked at by him. >> > >> > And I finally got to this email in my ''vacation-mbox'' >> >> >> >> Ian. >> >> >> >> On Mon, 2013-04-22 at 13:22 +0100, Anthony Sheetz wrote: >> >> > I realize folks are pretty busy, but we''re still interested in getting >> >> > this problem solved, and I want to be sure it''s not lost in the >> >> > shuffle. >> >> > Any chance of getting some attention for it? >> >> > >> >> > On Wed, Apr 17, 2013 at 9:00 AM, Ian Campbell <Ian.Campbell@citrix.com> wrote: >> >> > > On Tue, 2013-04-16 at 18:39 +0100, Anthony Sheetz wrote: >> >> > >> (re-sending, first message seems to have gotten lost) >> >> > >> >> >> > >> I was referred here by Ian Campbell ijc@hellion.org.uk from bugs.debian.org. >> >> > > >> >> > > I''m here too (different hat ;-)), thanks for posting it here. I''ve added >> >> > > some people who know about the block stuff to the CC. >> >> > > >> >> > > Guys, my suspicion is that the issue is that barriers issued by ext3 >> >> > > inside the guest aren''t making it all the way down the >> >> > > ext3->blkfront->blkback->lvm->dm-crypt->disk chain leading the >> >> > > filesystem to eventually corrupt itself. >> >> > > >> >> > > The issue seems to relate to the use of dm-crypt since >> >> > > ext3->blkfront->blkback->lvm->disk is reported work fine. >> >> > > >> >> > > However there is no problem with the local dom0 ext3 root filesystem >> >> > > which is also in the same lvm VG on the crypt device (i.e. >> >> > > ext3->lvm->dm-crypt->disk), so its not purely a dm-crypt issue. I figure >> >> > > something is up at the blkfront->back link which causes the barriers >> >> > > which blkback is injecting into the block subsystem either don''t make it >> >> > > to the dm-crypt layer or do not DTRT once they arrive. >> >> > > >> >> > > I''m not really sure with how to proceed (or how to ask Anthony to >> >> > > proceed) with verifying any part of that hypothesis though. >> >> > > >> >> > > ISTR issues with old vs new style barriers or barriers with no data in >> >> > > them or something, could this be related to that? (or am I thinking of >> >> > > DISCARD?) >> > >> > You are using two different kernel versions. The 2.6.32 domU is only using >> > WRITE_BARRIERs, while in the 3.2 kernels that have been completly eliminated. >> > The mechanism they use is called ''WRITE_FLUSH''. The 3.2 kernel has a patch: >> > ommit 29bde093787f3bdf7b9b4270ada6be7c8076e36b >> > Author: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> >> > Date: Mon Oct 10 00:42:22 2011 -0400 >> > >> > xen/blkback: Support ''feature-barrier'' aka old-style BARRIER requests. >> > >> > >> > which emulates the barrier request by draining all of the oustanding I/Os and then >> > sending the WRITE_FLUSH. >> > >> > But it looks like you are hitting an issue here. Just to make sure >> > that is the case, what happens if you use the _same_ kernel in both dom0 and >> > domU? Does it work then? >> > >> >> First, thank you so much for getting back to me, it''s really appreciated. >> At this point I''ve forgotten if I did this with Wheezy on Wheezy, and >> what the result was. >> I''ll have to test using the 3.2 kernel on the domU Debian Squeeze and >> get back to you. I should be able to do that early next week. > > Thank you. Also when you do this test, could you also provide the ''xenstore-ls'' > output from dom0? And the ''dmesg'' output from the guest (or at least > the ''xl console <guest> | tee /tmp/log'' ? That would give me and idea if > the frontend/backend have the right negotiation parameters. > > Have a good weekend!
On Tue, May 28, 2013 at 02:19:17PM -0400, Anthony Sheetz wrote:> I''d have thought so as well. It''s possible that was console output > from dom0, come to think of it.OK, any chance you could capture that? Some questions below:> > On Tue, May 28, 2013 at 2:18 PM, Konrad Rzeszutek Wilk > <konrad.wilk@oracle.com> wrote: > > On Tue, May 28, 2013 at 02:02:41PM -0400, Anthony Sheetz wrote: > >> On Tue, May 28, 2013 at 10:27 AM, Anthony Sheetz <sheetzam@inspire.com> wrote: > >> >> Thank you. Also when you do this test, could you also provide the ''xenstore-ls'' > >> >> output from dom0? And the ''dmesg'' output from the guest (or at least > >> >> the ''xl console <guest> | tee /tmp/log'' ? That would give me and idea if > >> >> the frontend/backend have the right negotiation parameters. > >> > > >> > Attached is the output of xenstore-ls from dom0, and dmesg from a domU > >> > with kernel 2.6.32-5-xen-amd64 > >> > Will be working on putting a 3.2 kernel in place next, testing file > >> > transfer, and adding the output of dmesg from that. > >> > >> updated to 3.2 using > >> http://www.cyberciti.biz/faq/debian-linux-6-apt-get-install-linux-kernel-3-2/ > >> for instructions. > >> During transfer of data saw this: BUG" scheduling while atomic: > >> kworker/0:2/10421/0x10000002 > > > > ? I don''t see it here? > >> Transfer test resulted in a file which did not match md5sum. Attached > >> is the dmesg output from the domU.So the transfer you are speaking of is.. What exactly is it that? Are you using ''scp'' to an disk in the guest? Can you describe to me how your disk in the guest is setup? When you do the ''md5sum'' do you do it after you have dropped the cache? Is the storage on an USB stick/disk?
On Wed, May 29, 2013 at 07:53:39AM -0400, Anthony Sheetz wrote:> Is there anything else I can get you at this time to help troubleshoot this?Well, this reminds me of a ext3 bug in the 2.6.32 stable tree that the maintainer of ext3 would not want to backport the fix. It was an bug that caused corruption. If I could just remember the email thread about it.> > On Fri, May 24, 2013 at 10:20 AM, Konrad Rzeszutek Wilk > <konrad.wilk@oracle.com> wrote: > > On Thu, May 23, 2013 at 02:19:50PM -0400, Anthony Sheetz wrote: > >> On Wed, May 22, 2013 at 4:10 PM, Konrad Rzeszutek Wilk > >> <konrad.wilk@oracle.com> wrote: > >> > On Mon, Apr 22, 2013 at 01:26:34PM +0100, Ian Campbell wrote: > >> >> Konrad is on vacation this week, so it''ll probably be next week before > >> >> this gets looked at by him. > >> > > >> > And I finally got to this email in my ''vacation-mbox'' > >> >> > >> >> Ian. > >> >> > >> >> On Mon, 2013-04-22 at 13:22 +0100, Anthony Sheetz wrote: > >> >> > I realize folks are pretty busy, but we''re still interested in getting > >> >> > this problem solved, and I want to be sure it''s not lost in the > >> >> > shuffle. > >> >> > Any chance of getting some attention for it? > >> >> > > >> >> > On Wed, Apr 17, 2013 at 9:00 AM, Ian Campbell <Ian.Campbell@citrix.com> wrote: > >> >> > > On Tue, 2013-04-16 at 18:39 +0100, Anthony Sheetz wrote: > >> >> > >> (re-sending, first message seems to have gotten lost) > >> >> > >> > >> >> > >> I was referred here by Ian Campbell ijc@hellion.org.uk from bugs.debian.org. > >> >> > > > >> >> > > I''m here too (different hat ;-)), thanks for posting it here. I''ve added > >> >> > > some people who know about the block stuff to the CC. > >> >> > > > >> >> > > Guys, my suspicion is that the issue is that barriers issued by ext3 > >> >> > > inside the guest aren''t making it all the way down the > >> >> > > ext3->blkfront->blkback->lvm->dm-crypt->disk chain leading the > >> >> > > filesystem to eventually corrupt itself. > >> >> > > > >> >> > > The issue seems to relate to the use of dm-crypt since > >> >> > > ext3->blkfront->blkback->lvm->disk is reported work fine. > >> >> > > > >> >> > > However there is no problem with the local dom0 ext3 root filesystem > >> >> > > which is also in the same lvm VG on the crypt device (i.e. > >> >> > > ext3->lvm->dm-crypt->disk), so its not purely a dm-crypt issue. I figure > >> >> > > something is up at the blkfront->back link which causes the barriers > >> >> > > which blkback is injecting into the block subsystem either don''t make it > >> >> > > to the dm-crypt layer or do not DTRT once they arrive. > >> >> > > > >> >> > > I''m not really sure with how to proceed (or how to ask Anthony to > >> >> > > proceed) with verifying any part of that hypothesis though. > >> >> > > > >> >> > > ISTR issues with old vs new style barriers or barriers with no data in > >> >> > > them or something, could this be related to that? (or am I thinking of > >> >> > > DISCARD?) > >> > > >> > You are using two different kernel versions. The 2.6.32 domU is only using > >> > WRITE_BARRIERs, while in the 3.2 kernels that have been completly eliminated. > >> > The mechanism they use is called ''WRITE_FLUSH''. The 3.2 kernel has a patch: > >> > ommit 29bde093787f3bdf7b9b4270ada6be7c8076e36b > >> > Author: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > >> > Date: Mon Oct 10 00:42:22 2011 -0400 > >> > > >> > xen/blkback: Support ''feature-barrier'' aka old-style BARRIER requests. > >> > > >> > > >> > which emulates the barrier request by draining all of the oustanding I/Os and then > >> > sending the WRITE_FLUSH. > >> > > >> > But it looks like you are hitting an issue here. Just to make sure > >> > that is the case, what happens if you use the _same_ kernel in both dom0 and > >> > domU? Does it work then? > >> > > >> > >> First, thank you so much for getting back to me, it''s really appreciated. > >> At this point I''ve forgotten if I did this with Wheezy on Wheezy, and > >> what the result was. > >> I''ll have to test using the 3.2 kernel on the domU Debian Squeeze and > >> get back to you. I should be able to do that early next week. > > > > Thank you. Also when you do this test, could you also provide the ''xenstore-ls'' > > output from dom0? And the ''dmesg'' output from the guest (or at least > > the ''xl console <guest> | tee /tmp/log'' ? That would give me and idea if > > the frontend/backend have the right negotiation parameters. > > > > Have a good weekend! > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel >
On Thu, May 30, 2013 at 2:36 PM, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:> On Wed, May 29, 2013 at 07:53:39AM -0400, Anthony Sheetz wrote: >> Is there anything else I can get you at this time to help troubleshoot this? > > Well, this reminds me of a ext3 bug in the 2.6.32 stable tree that > the maintainer of ext3 would not want to backport the fix. It was an > bug that caused corruption. > > If I could just remember the email thread about it. >> >> On Fri, May 24, 2013 at 10:20 AM, Konrad Rzeszutek Wilk >> <konrad.wilk@oracle.com> wrote: >> > On Thu, May 23, 2013 at 02:19:50PM -0400, Anthony Sheetz wrote: >> >> On Wed, May 22, 2013 at 4:10 PM, Konrad Rzeszutek Wilk >> >> <konrad.wilk@oracle.com> wrote: >> >> > On Mon, Apr 22, 2013 at 01:26:34PM +0100, Ian Campbell wrote: >> >> >> Konrad is on vacation this week, so it''ll probably be next week before >> >> >> this gets looked at by him. >> >> > >> >> > And I finally got to this email in my ''vacation-mbox'' >> >> >> >> >> >> Ian. >> >> >> >> >> >> On Mon, 2013-04-22 at 13:22 +0100, Anthony Sheetz wrote: >> >> >> > I realize folks are pretty busy, but we''re still interested in getting >> >> >> > this problem solved, and I want to be sure it''s not lost in the >> >> >> > shuffle. >> >> >> > Any chance of getting some attention for it? >> >> >> > >> >> >> > On Wed, Apr 17, 2013 at 9:00 AM, Ian Campbell <Ian.Campbell@citrix.com> wrote: >> >> >> > > On Tue, 2013-04-16 at 18:39 +0100, Anthony Sheetz wrote: >> >> >> > >> (re-sending, first message seems to have gotten lost) >> >> >> > >> >> >> >> > >> I was referred here by Ian Campbell ijc@hellion.org.uk from bugs.debian.org. >> >> >> > > >> >> >> > > I''m here too (different hat ;-)), thanks for posting it here. I''ve added >> >> >> > > some people who know about the block stuff to the CC. >> >> >> > > >> >> >> > > Guys, my suspicion is that the issue is that barriers issued by ext3 >> >> >> > > inside the guest aren''t making it all the way down the >> >> >> > > ext3->blkfront->blkback->lvm->dm-crypt->disk chain leading the >> >> >> > > filesystem to eventually corrupt itself. >> >> >> > > >> >> >> > > The issue seems to relate to the use of dm-crypt since >> >> >> > > ext3->blkfront->blkback->lvm->disk is reported work fine. >> >> >> > > >> >> >> > > However there is no problem with the local dom0 ext3 root filesystem >> >> >> > > which is also in the same lvm VG on the crypt device (i.e. >> >> >> > > ext3->lvm->dm-crypt->disk), so its not purely a dm-crypt issue. I figure >> >> >> > > something is up at the blkfront->back link which causes the barriers >> >> >> > > which blkback is injecting into the block subsystem either don''t make it >> >> >> > > to the dm-crypt layer or do not DTRT once they arrive. >> >> >> > > >> >> >> > > I''m not really sure with how to proceed (or how to ask Anthony to >> >> >> > > proceed) with verifying any part of that hypothesis though. >> >> >> > > >> >> >> > > ISTR issues with old vs new style barriers or barriers with no data in >> >> >> > > them or something, could this be related to that? (or am I thinking of >> >> >> > > DISCARD?) >> >> > >> >> > You are using two different kernel versions. The 2.6.32 domU is only using >> >> > WRITE_BARRIERs, while in the 3.2 kernels that have been completly eliminated. >> >> > The mechanism they use is called ''WRITE_FLUSH''. The 3.2 kernel has a patch: >> >> > ommit 29bde093787f3bdf7b9b4270ada6be7c8076e36b >> >> > Author: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> >> >> > Date: Mon Oct 10 00:42:22 2011 -0400 >> >> > >> >> > xen/blkback: Support ''feature-barrier'' aka old-style BARRIER requests. >> >> > >> >> > >> >> > which emulates the barrier request by draining all of the oustanding I/Os and then >> >> > sending the WRITE_FLUSH. >> >> > >> >> > But it looks like you are hitting an issue here. Just to make sure >> >> > that is the case, what happens if you use the _same_ kernel in both dom0 and >> >> > domU? Does it work then? >> >> > >> >> >> >> First, thank you so much for getting back to me, it''s really appreciated. >> >> At this point I''ve forgotten if I did this with Wheezy on Wheezy, and >> >> what the result was. >> >> I''ll have to test using the 3.2 kernel on the domU Debian Squeeze and >> >> get back to you. I should be able to do that early next week. >> > >> > Thank you. Also when you do this test, could you also provide the ''xenstore-ls'' >> > output from dom0? And the ''dmesg'' output from the guest (or at least >> > the ''xl console <guest> | tee /tmp/log'' ? That would give me and idea if >> > the frontend/backend have the right negotiation parameters. >> > >> > Have a good weekend! >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xen.org >> http://lists.xen.org/xen-devel >>Is there anything I can do at this point to help with this bug?
On Tue, Jun 04, 2013 at 08:55:26AM -0400, Anthony Sheetz wrote:> On Thu, May 30, 2013 at 2:36 PM, Konrad Rzeszutek Wilk > <konrad.wilk@oracle.com> wrote: > > On Wed, May 29, 2013 at 07:53:39AM -0400, Anthony Sheetz wrote: > >> Is there anything else I can get you at this time to help troubleshoot this? > > > > Well, this reminds me of a ext3 bug in the 2.6.32 stable tree that > > the maintainer of ext3 would not want to backport the fix. It was an > > bug that caused corruption. > > > > If I could just remember the email thread about it.Can''t recall it, but maybe Teck can?> >> > >> On Fri, May 24, 2013 at 10:20 AM, Konrad Rzeszutek Wilk > >> <konrad.wilk@oracle.com> wrote: > >> > On Thu, May 23, 2013 at 02:19:50PM -0400, Anthony Sheetz wrote: > >> >> On Wed, May 22, 2013 at 4:10 PM, Konrad Rzeszutek Wilk > >> >> <konrad.wilk@oracle.com> wrote: > >> >> > On Mon, Apr 22, 2013 at 01:26:34PM +0100, Ian Campbell wrote: > >> >> >> Konrad is on vacation this week, so it''ll probably be next week before > >> >> >> this gets looked at by him. > >> >> > > >> >> > And I finally got to this email in my ''vacation-mbox'' > >> >> >> > >> >> >> Ian. > >> >> >> > >> >> >> On Mon, 2013-04-22 at 13:22 +0100, Anthony Sheetz wrote: > >> >> >> > I realize folks are pretty busy, but we''re still interested in getting > >> >> >> > this problem solved, and I want to be sure it''s not lost in the > >> >> >> > shuffle. > >> >> >> > Any chance of getting some attention for it? > >> >> >> > > >> >> >> > On Wed, Apr 17, 2013 at 9:00 AM, Ian Campbell <Ian.Campbell@citrix.com> wrote: > >> >> >> > > On Tue, 2013-04-16 at 18:39 +0100, Anthony Sheetz wrote: > >> >> >> > >> (re-sending, first message seems to have gotten lost) > >> >> >> > >> > >> >> >> > >> I was referred here by Ian Campbell ijc@hellion.org.uk from bugs.debian.org. > >> >> >> > > > >> >> >> > > I''m here too (different hat ;-)), thanks for posting it here. I''ve added > >> >> >> > > some people who know about the block stuff to the CC. > >> >> >> > > > >> >> >> > > Guys, my suspicion is that the issue is that barriers issued by ext3 > >> >> >> > > inside the guest aren''t making it all the way down the > >> >> >> > > ext3->blkfront->blkback->lvm->dm-crypt->disk chain leading the > >> >> >> > > filesystem to eventually corrupt itself. > >> >> >> > > > >> >> >> > > The issue seems to relate to the use of dm-crypt since > >> >> >> > > ext3->blkfront->blkback->lvm->disk is reported work fine. > >> >> >> > > > >> >> >> > > However there is no problem with the local dom0 ext3 root filesystem > >> >> >> > > which is also in the same lvm VG on the crypt device (i.e. > >> >> >> > > ext3->lvm->dm-crypt->disk), so its not purely a dm-crypt issue. I figure > >> >> >> > > something is up at the blkfront->back link which causes the barriers > >> >> >> > > which blkback is injecting into the block subsystem either don''t make it > >> >> >> > > to the dm-crypt layer or do not DTRT once they arrive. > >> >> >> > > > >> >> >> > > I''m not really sure with how to proceed (or how to ask Anthony to > >> >> >> > > proceed) with verifying any part of that hypothesis though. > >> >> >> > > > >> >> >> > > ISTR issues with old vs new style barriers or barriers with no data in > >> >> >> > > them or something, could this be related to that? (or am I thinking of > >> >> >> > > DISCARD?) > >> >> > > >> >> > You are using two different kernel versions. The 2.6.32 domU is only using > >> >> > WRITE_BARRIERs, while in the 3.2 kernels that have been completly eliminated. > >> >> > The mechanism they use is called ''WRITE_FLUSH''. The 3.2 kernel has a patch: > >> >> > ommit 29bde093787f3bdf7b9b4270ada6be7c8076e36b > >> >> > Author: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > >> >> > Date: Mon Oct 10 00:42:22 2011 -0400 > >> >> > > >> >> > xen/blkback: Support ''feature-barrier'' aka old-style BARRIER requests. > >> >> > > >> >> > > >> >> > which emulates the barrier request by draining all of the oustanding I/Os and then > >> >> > sending the WRITE_FLUSH. > >> >> > > >> >> > But it looks like you are hitting an issue here. Just to make sure > >> >> > that is the case, what happens if you use the _same_ kernel in both dom0 and > >> >> > domU? Does it work then? > >> >> > > >> >> > >> >> First, thank you so much for getting back to me, it''s really appreciated. > >> >> At this point I''ve forgotten if I did this with Wheezy on Wheezy, and > >> >> what the result was. > >> >> I''ll have to test using the 3.2 kernel on the domU Debian Squeeze and > >> >> get back to you. I should be able to do that early next week. > >> > > >> > Thank you. Also when you do this test, could you also provide the ''xenstore-ls'' > >> > output from dom0? And the ''dmesg'' output from the guest (or at least > >> > the ''xl console <guest> | tee /tmp/log'' ? That would give me and idea if > >> > the frontend/backend have the right negotiation parameters. > >> > > >> > Have a good weekend! > >> > >> _______________________________________________ > >> Xen-devel mailing list > >> Xen-devel@lists.xen.org > >> http://lists.xen.org/xen-devel > >> > > Is there anything I can do at this point to help with this bug?
On Tue, Jun 04, 2013 at 09:41:10AM -0400, Konrad Rzeszutek Wilk wrote:> On Tue, Jun 04, 2013 at 08:55:26AM -0400, Anthony Sheetz wrote: > > On Thu, May 30, 2013 at 2:36 PM, Konrad Rzeszutek Wilk > > <konrad.wilk@oracle.com> wrote: > > > On Wed, May 29, 2013 at 07:53:39AM -0400, Anthony Sheetz wrote: > > >> Is there anything else I can get you at this time to help troubleshoot this? > > > > > > Well, this reminds me of a ext3 bug in the 2.6.32 stable tree that > > > the maintainer of ext3 would not want to backport the fix. It was an > > > bug that caused corruption. > > > > > > If I could just remember the email thread about it. > > Can''t recall it, but maybe Teck can?He doesn''t seem to respond. Anthony, I have this on my queue to look - so will get to it. Sadly that is not going to happen this week :-(
Not a problem. Just wanted to be sure we weren''t a dependency. Thanks for your attention! On Fri, Jun 7, 2013 at 1:10 PM, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:> On Tue, Jun 04, 2013 at 09:41:10AM -0400, Konrad Rzeszutek Wilk wrote: >> On Tue, Jun 04, 2013 at 08:55:26AM -0400, Anthony Sheetz wrote: >> > On Thu, May 30, 2013 at 2:36 PM, Konrad Rzeszutek Wilk >> > <konrad.wilk@oracle.com> wrote: >> > > On Wed, May 29, 2013 at 07:53:39AM -0400, Anthony Sheetz wrote: >> > >> Is there anything else I can get you at this time to help troubleshoot this? >> > > >> > > Well, this reminds me of a ext3 bug in the 2.6.32 stable tree that >> > > the maintainer of ext3 would not want to backport the fix. It was an >> > > bug that caused corruption. >> > > >> > > If I could just remember the email thread about it. >> >> Can''t recall it, but maybe Teck can? > > > He doesn''t seem to respond. > > Anthony, I have this on my queue to look - so will get to it. > Sadly that is not going to happen this week :-(
On Fri, Jun 07, 2013 at 02:43:06PM -0400, Anthony Sheetz wrote:> Not a problem. Just wanted to be sure we weren''t a dependency. Thanks > for your attention! > > On Fri, Jun 7, 2013 at 1:10 PM, Konrad Rzeszutek Wilk > <konrad.wilk@oracle.com> wrote: > > On Tue, Jun 04, 2013 at 09:41:10AM -0400, Konrad Rzeszutek Wilk wrote: > >> On Tue, Jun 04, 2013 at 08:55:26AM -0400, Anthony Sheetz wrote: > >> > On Thu, May 30, 2013 at 2:36 PM, Konrad Rzeszutek Wilk > >> > <konrad.wilk@oracle.com> wrote: > >> > > On Wed, May 29, 2013 at 07:53:39AM -0400, Anthony Sheetz wrote: > >> > >> Is there anything else I can get you at this time to help troubleshoot this? > >> > > > >> > > Well, this reminds me of a ext3 bug in the 2.6.32 stable tree that > >> > > the maintainer of ext3 would not want to backport the fix. It was an > >> > > bug that caused corruption. > >> > > > >> > > If I could just remember the email thread about it. > >> > >> Can''t recall it, but maybe Teck can? > > > > > > He doesn''t seem to respond. > > > > Anthony, I have this on my queue to look - so will get to it. > > Sadly that is not going to happen this week :-(Installing a new box with Wheezy to try this out. The one thing I could not find in the thread and in the bug was the guest config. Could you please reply back with it? Thanks.> > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel >