Sergio Gelato wrote[1]:> That 3.4.1-1~experimental.1 build > (3.4-trunk-686-pae #1 SMP Wed Jun 6 15:11:31 UTC 2012 i686 GNU/Linux) > is even less well-behaved under Xen: I''m getting a kernel OOPS at > EIP: [<c1168e54>] atomic64_read_cx8+0x4/0xc SS:ESP e021:ca853c6c > The top of the trace message unfortunately scrolled off the console before I > could see it, and the message doesn''t have time to make it to syslog (either > local or remote).[...]> Non-Xen boots proceed normally.Yeah, apparently[2] that''s caused by commit 26c191788f18 Author: Andrea Arcangeli <aarcange@redhat.com> Date: Tue May 29 15:06:49 2012 -0700 mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition which was also included in Debian kernel 3.2.19-1. [1] http://bugs.debian.org/676360 [2] https://bugzilla.redhat.com/show_bug.cgi?id=829016#c4
On Thu, Jun 07, 2012 at 02:33:33AM -0500, Jonathan Nieder wrote:> Sergio Gelato wrote[1]: > > > That 3.4.1-1~experimental.1 build > > (3.4-trunk-686-pae #1 SMP Wed Jun 6 15:11:31 UTC 2012 i686 GNU/Linux) > > is even less well-behaved under Xen: I''m getting a kernel OOPS at > > EIP: [<c1168e54>] atomic64_read_cx8+0x4/0xc SS:ESP e021:ca853c6c > > The top of the trace message unfortunately scrolled off the console before I > > could see it, and the message doesn''t have time to make it to syslog (either > > local or remote). > [...] > > Non-Xen boots proceed normally. > > Yeah, apparently[2] that''s caused by > > commit 26c191788f18 > Author: Andrea Arcangeli <aarcange@redhat.com> > Date: Tue May 29 15:06:49 2012 -0700 > > mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition > > which was also included in Debian kernel 3.2.19-1. > > [1] http://bugs.debian.org/676360 > [2] https://bugzilla.redhat.com/show_bug.cgi?id=829016#c4Oops, sorry I didn''t imagine atomic64_read on a pmd would trip. Unfortunately to support pagetable walking with mmap_sem hold for reading, we need an atomic read on 32bit PAE if CONFIG_TRANSPARENT_HUGEPAGE=y. The only case requiring this is 32bit PAE with CONFIG_TRANSPARENT_HUGEPAGE=y at build time. If you set CONFIG_TRANSPARENT_HUGEPAGE=n temporarily you should be able to work around this as I optimized the code in a way to avoid an expensive cmpxchg8b. I guess if Xen can''t be updated to handle an atomic64_read on a pmd in the guest, we can add a pmd_read paravirt op? Or if we don''t want to break the paravirt interface a loop like gup_fast with irq disabled should also work but looping + local_irq_disable()/enable() sounded worse and more complex than a atomic64_read (gup fast already disables irqs because it doesn''t hold the mmap_sem so it''s a different cost looping there). AFIK Xen disables THP during boot, so a check on THP being enabled and falling back in the THP=n version of pmd_read_atomic, would also be safe, but it''s not so nice to do it with a runtime check. Thanks, Andrea
* Andrea Arcangeli [2012-06-07 12:33:55 +0200]:> I guess if Xen can''t be updated to handle an atomic64_read on a pmd in > the guest,I''m not sure if it makes a difference, but just in case: I observed the problem in a dom0.> we can add a pmd_read paravirt op? Or if we don''t want to > break the paravirt interface a loop like gup_fast with irq disabled > should also work but looping + local_irq_disable()/enable() sounded > worse and more complex than a atomic64_read (gup fast already disables > irqs because it doesn''t hold the mmap_sem so it''s a different cost > looping there). AFIK Xen disables THP during boot, so a check on THP > being enabled and falling back in the THP=n version of > pmd_read_atomic, would also be safe, but it''s not so nice to do it > with a runtime check. > > Thanks, > Andrea
Konrad Rzeszutek Wilk
2012-Jun-07 15:56 UTC
Bug#676360: [Xen-devel] xen: oops at atomic64_read_cx8+0x4
On Thu, Jun 07, 2012 at 12:33:55PM +0200, Andrea Arcangeli wrote:> On Thu, Jun 07, 2012 at 02:33:33AM -0500, Jonathan Nieder wrote: > > Sergio Gelato wrote[1]: > > > > > That 3.4.1-1~experimental.1 build > > > (3.4-trunk-686-pae #1 SMP Wed Jun 6 15:11:31 UTC 2012 i686 GNU/Linux) > > > is even less well-behaved under Xen: I''m getting a kernel OOPS at > > > EIP: [<c1168e54>] atomic64_read_cx8+0x4/0xc SS:ESP e021:ca853c6c > > > The top of the trace message unfortunately scrolled off the console before I > > > could see it, and the message doesn''t have time to make it to syslog (either > > > local or remote). > > [...] > > > Non-Xen boots proceed normally. > > > > Yeah, apparently[2] that''s caused by > > > > commit 26c191788f18 > > Author: Andrea Arcangeli <aarcange@redhat.com> > > Date: Tue May 29 15:06:49 2012 -0700 > > > > mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition > > > > which was also included in Debian kernel 3.2.19-1. > > > > [1] http://bugs.debian.org/676360 > > [2] https://bugzilla.redhat.com/show_bug.cgi?id=829016#c4 > > Oops, sorry I didn''t imagine atomic64_read on a pmd would trip.Hmm, so it looks like it used to do this: pmd = pmd_offset(pud, addr); .. pmd_t pmdval = *pmd; but now you do: pmd_t ret = (pmd_val)((u32)*tmp); ret |= (*tmp+1) << 32; which would read the low first and then the high one next (or is the other way around?). The ''pmd_offset'' beforehand manufactures the pmd using the PFN to MFN lookup tree (so that there aren''t any hypercall or traps). Hm, with your change, you are still looking at the ''pmd'' and its contents, except that you are reading the low and then the high part. Why that would trip the hypervisor is not clear to me. Perhaps in the past it only read the low bits? If there was Xen hypervisor log that might give some ideas. Is there any chance that the Linode folks could send that over?> > Unfortunately to support pagetable walking with mmap_sem hold for > reading, we need an atomic read on 32bit PAE if > CONFIG_TRANSPARENT_HUGEPAGE=y. > > The only case requiring this is 32bit PAE with > CONFIG_TRANSPARENT_HUGEPAGE=y at build time. If you set > CONFIG_TRANSPARENT_HUGEPAGE=n temporarily you should be able to work > around this as I optimized the code in a way to avoid an expensive > cmpxchg8b.Ah, by just skipping the thing if the low bits are zero.> > I guess if Xen can''t be updated to handle an atomic64_read on a pmd in > the guest, we can add a pmd_read paravirt op? Or if we don''t want to > break the paravirt interface a loop like gup_fast with irq disabled > should also work but looping + local_irq_disable()/enable() sounded > worse and more complex than a atomic64_read (gup fast already disables > irqs because it doesn''t hold the mmap_sem so it''s a different costI am not really sure what is at foot. It sounds like the hypervisor didn''t like somebody reading the high and low bit, but isn''t the pmdval_t still 64-bit ? So I would have thought this would have been triggered? Or is that the code on pmd_val never actually read the high bits (before your addition to the atomic_read?)?> looping there). AFIK Xen disables THP during boot, so a check on THP > being enabled and falling back in the THP=n version of > pmd_read_atomic, would also be safe, but it''s not so nice to do it > with a runtime check.The thing is that I did install a 32-bit PAE guest (a Fedora) on a Fedora 17 dom0. So it looks like this is reading high part is fixed on the newer hypervisors, but now with the older ones. And the older one is Amazon EC2 so some .. hack to workaround older hypervisors could be added.
Hi, On Thu, Jun 07, 2012 at 11:56:47AM -0400, Konrad Rzeszutek Wilk wrote:> then the high part. Why that would trip the hypervisor > is not clear to me. Perhaps in the past it only read theThat is the CONFIG_TRANSPARENT_HUGEPAGE=n case and in fact it doesn''t trip the hypervisor. That was tested too, it should work fine. The problem is with the atomic64_read version, that one uses cmpxchg8b to read the contents of the pmdp.> Ah, by just skipping the thing if the low bits are zero.Yep.> didn''t like somebody reading the high and low bit, but isn''t the > pmdval_t still 64-bit ? So I would have thought this wouldThe pmd format is unchanged, that''s hardware.> The thing is that I did install a 32-bit PAE guest (a Fedora) on a Fedora > 17 dom0. So it looks like this is reading high part is fixed on the newer > hypervisors, but now with the older ones. And the older one is Amazon EC2 > so some .. hack to workaround older hypervisors could be added.The insn oopsing is cmpxchg8b and it''s not reading the low/high part in two separate insn but reading it in a single insn, which means the kernel oopsing was built with CONFIG_TRANSPARENT_HUGEPAGE=y.
>>> Andrea Arcangeli <aarcange@redhat.com> 06/07/12 12:35 PM >>> >Oops, sorry I didn''t imagine atomic64_read on a pmd would trip.The problem really is that the cmpxchg8b is a write, and hence won''t go through without faulting on a write protected page (which all page table pages necessarily are).>I guess if Xen can''t be updated to handle an atomic64_read on a pmd in >the guest, we can add a pmd_read paravirt op?Xen could certainly be updated to treat cmpxchg8b on a PMD entry as a simple 8-byte read when compared-to and to-be-stored values are identical, but the problem would be that hypervisors in the field wouldn''t necessarily get updated.>Or if we don''t want to >break the paravirt interface a loop like gup_fast with irq disabled >should also work but looping + local_irq_disable()/enable() sounded >worse and more complex than a atomic64_read (gup fast already disables >irqs because it doesn''t hold the mmap_sem so it''s a different cost >looping there). AFIK Xen disables THP during boot, so a check on THP >being enabled and falling back in the THP=n version of >pmd_read_atomic, would also be safe, but it''s not so nice to do it >with a runtime check.That would probably nevertheless be the most viable option. If performance critical(?), maybe this could be hidden with something like the alternative instruction or paravirt patching mechanisms? Jan