Jan Beulich
2012-Feb-17 15:50 UTC
regression from 22242:7831b8e5aae2 (x86 guest pagetable walker: check for invalid bits in pagetable entries)?
Tim, this c/s or-s PFEC_reserved_bit into the error code passed to the guest in two places, and I''m afraid it is responsible for problems we''re seeing when a Linux HVM PAE guest on a non-HAP platform runs into invocations of pgtable_bad(). Linux check the reserved bit before looking at the present bit, and expects the reserved bit to always be clear when the present bit also is. The respective lines from the guest''s kernel log however are Bad pagetable: 000c [#1] SMP Bad pagetable: 000e [#2] SMP Bad pagetable: 000e [#3] SMP Bad pagetable: 000c [#4] SMP Bad pagetable: 000c [#5] SMP Bad pagetable: 000c [#6] SMP Bad pagetable: 000a [#7] SMP Bad pagetable: 000a [#8] SMP Bad pagetable: 000c [#10] SMP Bad pagetable: 000c [#11] SMP (i.e. reserved flag always set, present flag always clear) with the corresponding page table dumps being *pdpt = 0000000035682001 *pde = 0000000013014067 *pte = 002dea0000000000 *pdpt = 0000000035d65001 *pde = 0000000028a8a067 *pte = 002dd16000000000 *pdpt = 0000000013177001 *pde = 0000000028bf9067 *pte = 002dd16000000000 *pdpt = 0000000035480001 *pde = 000000005e6b1067 *pdpt = 00000000133b6001 *pde = 000000005e6c1067 *pdpt = 00000000131b7001 *pde = 000000005de15067 *pdpt = 00000000354bd001 *pde = 00000000355ad067 *pte = 002dfca000000000 *pdpt = 000000002bd5c001 *pde = 000000002bf2c067 *pte = 002dfa6000000000 *pdpt = 000000002bd79001 *pde = 0000000000000000 *pdpt = 000000002bdc6001 *pde = 000000005e63c067 *pdpt = 00000000130ef001 *pde = 000000005e690067 (i.e. no invalid bits at all, but paged out entries with page indexes into the swap file indicating a swap size of over 10G - with anything up to 4G, the reserved bits would never be run into). The problem appears to be that the or-ing in of PFEC_reserved_bit happens without consideration of PFEC_present. If you can confirm I''m not mis-interpreting things, fixing this should be relatively strait forward (though the question would be whether it should be guest_walk_tables() or its callers that should be fixed). Jan
Tim Deegan
2012-Feb-17 16:23 UTC
Re: regression from 22242:7831b8e5aae2 (x86 guest pagetable walker: check for invalid bits in pagetable entries)?
Hi, At 15:50 +0000 on 17 Feb (1329493838), Jan Beulich wrote:> this c/s or-s PFEC_reserved_bit into the error code passed to the guest > in two places, and I''m afraid it is responsible for problems we''re seeing > when a Linux HVM PAE guest on a non-HAP platform runs into > invocations of pgtable_bad(). Linux check the reserved bit before > looking at the present bit, and expects the reserved bit to always be > clear when the present bit also is. The respective lines from the > guest''s kernel log however are > > Bad pagetable: 000c [#1] SMP > Bad pagetable: 000e [#2] SMP > Bad pagetable: 000e [#3] SMP > Bad pagetable: 000c [#4] SMP > Bad pagetable: 000c [#5] SMP > Bad pagetable: 000c [#6] SMP > Bad pagetable: 000a [#7] SMP > Bad pagetable: 000a [#8] SMP > Bad pagetable: 000c [#10] SMP > Bad pagetable: 000c [#11] SMP > > (i.e. reserved flag always set, present flag always clear) with the > corresponding page table dumps being > > *pdpt = 0000000035682001 *pde = 0000000013014067 *pte = 002dea0000000000 > *pdpt = 0000000035d65001 *pde = 0000000028a8a067 *pte = 002dd16000000000 > *pdpt = 0000000013177001 *pde = 0000000028bf9067 *pte = 002dd16000000000 > *pdpt = 0000000035480001 *pde = 000000005e6b1067 > *pdpt = 00000000133b6001 *pde = 000000005e6c1067 > *pdpt = 00000000131b7001 *pde = 000000005de15067 > *pdpt = 00000000354bd001 *pde = 00000000355ad067 *pte = 002dfca000000000 > *pdpt = 000000002bd5c001 *pde = 000000002bf2c067 *pte = 002dfa6000000000 > *pdpt = 000000002bd79001 *pde = 0000000000000000 > *pdpt = 000000002bdc6001 *pde = 000000005e63c067 > *pdpt = 00000000130ef001 *pde = 000000005e690067 > > (i.e. no invalid bits at all, but paged out entries with page indexes > into the swap file indicating a swap size of over 10G - with anything > up to 4G, the reserved bits would never be run into).Right - so the problem is that the walker is choking on the reserved bits even though the present bit it clear in that particular PTE? Yeah, that''s a bug.> The problem appears to be that the or-ing in of PFEC_reserved_bit > happens without consideration of PFEC_present. If you can confirm > I''m not mis-interpreting things, fixing this should be relatively > strait forward (though the question would be whether it should be > guest_walk_tables() or its callers that should be fixed).We should fix it in guest_walk_tables, since AFAICS it''s possible for PFEC_reserved_bit to be set based on a bad higher-level entry even if a lower-level one has _PAGE_PRESENT clear. Something like the attached (compile-tested only) patch? (Sigh; this PT walker was once relatively readable and efficient.) Cheers, Tim. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2012-Feb-17 16:58 UTC
Re: regression from 22242:7831b8e5aae2 (x86 guest pagetable walker: check for invalid bits in pagetable entries)?
>>> On 17.02.12 at 17:23, Tim Deegan <tim@xen.org> wrote: > At 15:50 +0000 on 17 Feb (1329493838), Jan Beulich wrote: >> The problem appears to be that the or-ing in of PFEC_reserved_bit >> happens without consideration of PFEC_present. If you can confirm >> I''m not mis-interpreting things, fixing this should be relatively >> strait forward (though the question would be whether it should be >> guest_walk_tables() or its callers that should be fixed). > > We should fix it in guest_walk_tables, since AFAICS it''s possible for > PFEC_reserved_bit to be set based on a bad higher-level entry even if a > lower-level one has _PAGE_PRESENT clear. > > Something like the attached (compile-tested only) patch?That was really fast, thanks! Yes, that looks like it should do it. We''ll want to give it a try early next week.> (Sigh; this PT walker was once relatively readable and efficient.)Sorry for that ;-) (I wonder though whether this whole evaluation of the flags couldn''t be put in a macro, as it''s - apart from the level number - the same in all four places afaics. Jan
Tim Deegan
2012-Feb-23 10:33 UTC
Re: regression from 22242:7831b8e5aae2 (x86 guest pagetable walker: check for invalid bits in pagetable entries)?
At 16:58 +0000 on 17 Feb (1329497937), Jan Beulich wrote:> >>> On 17.02.12 at 17:23, Tim Deegan <tim@xen.org> wrote: > > At 15:50 +0000 on 17 Feb (1329493838), Jan Beulich wrote: > >> The problem appears to be that the or-ing in of PFEC_reserved_bit > >> happens without consideration of PFEC_present. If you can confirm > >> I''m not mis-interpreting things, fixing this should be relatively > >> strait forward (though the question would be whether it should be > >> guest_walk_tables() or its callers that should be fixed). > > > > We should fix it in guest_walk_tables, since AFAICS it''s possible for > > PFEC_reserved_bit to be set based on a bad higher-level entry even if a > > lower-level one has _PAGE_PRESENT clear. > > > > Something like the attached (compile-tested only) patch? > > That was really fast, thanks! > > Yes, that looks like it should do it. We''ll want to give it a try early > next week.I''ve applied it, since it seemed not to break anything and I''ll be away from my test boxes for a while. Let me know of you''re still seeing any problems. Cheers, Tim.
Jan Beulich
2012-Feb-23 14:34 UTC
Re: regression from 22242:7831b8e5aae2 (x86 guest pagetable walker: check for invalid bits in pagetable entries)?
>>> Tim Deegan <tim@xen.org> 02/23/12 11:34 AM >>> >I''ve applied it, since it seemed not to break anything and I''ll be away >from my test boxes for a while. Let me know of you''re still seeing any >problems.Thanks - I had hoped we would have reported testing results earlier, but this unfortunately is going pretty slowly. One thing though - shouldn''t further walking be suppressed if at a given level any violation is detected (i.e. if rc became non-zero)? For sure the hardware aborts a walk at least if reserved bits are found set, and I''d suspect it also doesn''t bother continuing the walk if access rights don''t permit the result to be used. Jan
Tim Deegan
2012-Feb-23 14:59 UTC
Re: regression from 22242:7831b8e5aae2 (x86 guest pagetable walker: check for invalid bits in pagetable entries)?
At 14:34 +0000 on 23 Feb (1330007656), Jan Beulich wrote:> >>> Tim Deegan <tim@xen.org> 02/23/12 11:34 AM >>> > >I''ve applied it, since it seemed not to break anything and I''ll be away > >from my test boxes for a while. Let me know of you''re still seeing any > >problems. > > Thanks - I had hoped we would have reported testing results earlier, but > this unfortunately is going pretty slowly. > > One thing though - shouldn''t further walking be suppressed if at a > given level any violation is detected (i.e. if rc became non-zero)? For > sure the hardware aborts a walk at least if reserved bits are found > set, and I''d suspect it also doesn''t bother continuing the walk if access > rights don''t permit the result to be used.That sounds plausible, but I recall there being some subtleties about that when we first looked at it and I haven''t time to dig up what they were right now. I may be misremembering but I don''t want to tinker with it without being sure. I''ll look into it again when I get a minute. Tim.