On Sun, Sep 04, 2016 at 11:19:16AM +0300, Andriy Gapon wrote:
> On 01/09/2016 15:13, Slawa Olhovchenkov wrote:
> > DMAR: Found table at 0x79b32798
> > x2APIC available but disabled by DMAR table
>
> > Event timer "LAPIC" quality 600
> > LAPIC: ipi_wait() us multiplier 1 (r 116268019 tsc 2200043851)
> > ACPI APIC Table: <ALASKA A M I >
> > Package ID shift: 5
> > L3 cache ID shift: 5
> > L2 cache ID shift: 1
> > L1 cache ID shift: 1
> > Core ID shift: 1
> > kernel trap 12 with interrupts disabled
> >
> >
> > Fatal trap 12: page fault while in kernel mode
> > cpuid = 0; apic id = ff
>
> > fault virtual address = 0x0
> > fault code = supervisor read data, page not present
> > instruction pointer = 0x20:0xffffffff80537e74
> > stack pointer = 0x28:0xffffffff814b4a60
> > frame pointer = 0x28:0xffffffff814b4a70
> > code segment = base 0x0, limit 0xfffff, type 0x1b
> > = DPL 0, pres 1, long 1, def32 0, gran 1
> > processor eflags = resume, IOPL = 0
> > current process = 0 ()
> > trap number = 12
> > panic: page fault
> > cpuid = 0
> > KDB: stack backtrace:
> > #0 0xffffffff805272e7 at kdb_backtrace+0x67
> > #1 0xffffffff804dd662 at vpanic+0x182
> > #2 0xffffffff804dd4d3 at panic+0x43
> > #3 0xffffffff807a3791 at trap_fatal+0x351
> > #4 0xffffffff807a3983 at trap_pfault+0x1e3
> > #5 0xffffffff807a2f0c at trap+0x26c
> > #6 0xffffffff80787ca1 at calltrap+0x8
> > #7 0xffffffff8083b52a at topo_probe+0x61a
>
> Interesting. Could you please do 'list *topo_probe+0x61a' in kgdb,
so that I
(kgdb) list *topo_probe+0x61a
0xffffffff8083b52a is in topo_probe (/usr/src/sys/x86/x86/mp_x86.c:540).
535 topo_layers[layer].subtype);
536 }
537 }
538
539 parent = &topo_root;
540 for (layer = 0; layer < nlayers; ++layer) {
541 node_id = boot_cpu_id >>
topo_layers[layer].id_shift;
542 node = topo_find_node_by_hwid(parent, node_id,
543 topo_layers[layer].type,
544 topo_layers[layer].subtype);
Current language: auto; currently minimal
> can see what code is being executed when the trap happens? Also,
disassembly of
> the function could be useful as well.
(kgdb) x/40i *topo_probe+0x600
0xffffffff8083b510 <topo_probe+1536>: and $0xf8,%al
0xffffffff8083b512 <topo_probe+1538>: movslq -0x4(%r12),%rcx
0xffffffff8083b517 <topo_probe+1543>: mov %rbx,%rdi
0xffffffff8083b51a <topo_probe+1546>: callq 0xffffffff80537e30
<topo_find_node_by_hwid>
0xffffffff8083b51f <topo_probe+1551>: mov %rax,%rbx
0xffffffff8083b522 <topo_probe+1554>: mov %rbx,%rdi
0xffffffff8083b525 <topo_probe+1557>: callq 0xffffffff80537e70
<topo_promote_child>
0xffffffff8083b52a <topo_probe+1562>: add $0xc,%r12
0xffffffff8083b52e <topo_probe+1566>: dec %r14d
0xffffffff8083b531 <topo_probe+1569>: jne 0xffffffff8083b500
<topo_probe+1520>
0xffffffff8083b533 <topo_probe+1571>: movb $0x1,0xffffffff80dfa664
0xffffffff8083b53b <topo_probe+1579>: add $0x68,%rsp
0xffffffff8083b53f <topo_probe+1583>: pop %rbx
0xffffffff8083b540 <topo_probe+1584>: pop %r12
0xffffffff8083b542 <topo_probe+1586>: pop %r13
0xffffffff8083b544 <topo_probe+1588>: pop %r14
0xffffffff8083b546 <topo_probe+1590>: pop %r15
0xffffffff8083b548 <topo_probe+1592>: pop %rbp
0xffffffff8083b549 <topo_probe+1593>: retq
0xffffffff8083b54a <topo_probe+1594>: nopw 0x0(%rax,%rax,1)
> Wait...
> Kostik, I see one strange thing which is common to both successful and
> unsuccessful configurations. All "SMP: Added CPU..." lines have
"AP" in them.
for #1..#23
no line 'SMP: AP CPU #0 Launched!'
> It seems like the platform does not tell explicitly tell which CPU is the
BSP,
> see cpu_add() function. This can break quite a few assumption. And I am
not
> even sure how the successful scenario works.
# mptable
==============================================================================
MPTable
-------------------------------------------------------------------------------
MP Floating Pointer Structure:
location: BIOS
physical address: 0x000fd050
signature: '_MP_'
length: 16 bytes
version: 1.4
checksum: 0x27
mode: Virtual Wire
-------------------------------------------------------------------------------
MP Config Table Header:
physical address: 0x000fcaa0
signature: 'PCMP'
base table length: 1228
version: 1.4
checksum: 0x95
OEM ID: 'A M I'
Product ID: 'ALASKA'
OEM table pointer: 0x00000000
OEM table size: 0
entry count: 112
local APIC address: 0xfee00000
extended table length: 220
extended table checksum: 72
-------------------------------------------------------------------------------
MP Config Base Table Entries:
--
Processors: APIC ID Version State Family Model Step Flags
0 0x15 BSP, usable 6 15 1
0xbfebfbff
2 0x15 AP, usable 6 15 1
0xbfebfbff
4 0x15 AP, usable 6 15 1
0xbfebfbff
6 0x15 AP, usable 6 15 1
0xbfebfbff
8 0x15 AP, usable 6 15 1
0xbfebfbff
10 0x15 AP, usable 6 15 1
0xbfebfbff
16 0x15 AP, usable 6 15 1
0xbfebfbff
18 0x15 AP, usable 6 15 1
0xbfebfbff
20 0x15 AP, usable 6 15 1
0xbfebfbff
22 0x15 AP, usable 6 15 1
0xbfebfbff
24 0x15 AP, usable 6 15 1
0xbfebfbff
26 0x15 AP, usable 6 15 1
0xbfebfbff
32 0x15 AP, usable 6 15 1
0xbfebfbff
34 0x15 AP, usable 6 15 1
0xbfebfbff
36 0x15 AP, usable 6 15 1
0xbfebfbff
38 0x15 AP, usable 6 15 1
0xbfebfbff
40 0x15 AP, usable 6 15 1
0xbfebfbff
42 0x15 AP, usable 6 15 1
0xbfebfbff
48 0x15 AP, usable 6 15 1
0xbfebfbff
50 0x15 AP, usable 6 15 1
0xbfebfbff
52 0x15 AP, usable 6 15 1
0xbfebfbff
54 0x15 AP, usable 6 15 1
0xbfebfbff
56 0x15 AP, usable 6 15 1
0xbfebfbff
58 0x15 AP, usable 6 15 1
0xbfebfbff
> Ah... I see that there is a backup code in cpu_mp_start() where boot_cpu_id
is
> set based on the current CPU's Local APIC ID. I suspect then that this
> information is incorrect in the failing case.
>
> Slawa,
> my guess can be checked by adding a printf to cpu_mp_start() right after
> boot_cpu_id assignment.
System now in early production and I can't be reboot often.
> > #8 0xffffffff8078fe81 at cpu_mp_start+0x1b1
> > #9 0xffffffff805382ca at mp_start+0x3a
> > #10 0xffffffff80465cd8 at mi_startup+0x118
> > #11 0xffffffff8028dfac at btext+0x2c
> > Uptime: 1s
>
>
> --
> Andriy Gapon