Hi all, This was also discussed earlier, for example here http://xen.markmail.org/thread/iqvkylp3mclmsnbw Changeset 25079:d5ccb2d1dbd1 (Introduce system_state variable) added a global variable, which, among other things, is used to prevent disabling cpu scheduler, prevent breaking vcpu affinities, prevent removing the cpu from cpupool on suspend. However, it missed one place where cpu is removed from the cpupool valid cpus mask, in smpboot.c, __cpu_disable(), line 840: cpumask_clear_cpu(cpu, cpupool0->cpu_valid); This causes the vcpu in the default pool to be considered inactive, and the following assertion is violated in sched_credit.c soon after resume transitions out of xen, causing a platform reboot: (XEN) Finishing wakeup from ACPI S3 state. (XEN) Enabling non-boot CPUs ... (XEN) Assertion ''!cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus)'' failed at sched_credit.c:507 (XEN) ----[ Xen-4.3-unstable x86_64 debug=y Tainted: C ]---- (XEN) CPU: 1 (XEN) RIP: e008:[<ffff82c480119e9e>] _csched_cpu_pick+0x155/0x5fd (XEN) RFLAGS: 0000000000010202 CONTEXT: hypervisor (XEN) rax: 0000000000000001 rbx: 0000000000000008 rcx: 0000000000000008 (XEN) rdx: 00000000000000ff rsi: 0000000000000008 rdi: 0000000000000000 (XEN) rbp: ffff83011415fdd8 rsp: ffff83011415fcf8 r8: 0000000000000000 (XEN) r9: 000000000000003e r10: 00000008f3de731f r11: ffffea0000063800 (XEN) r12: ffff82c480261720 r13: ffff830137b4d950 r14: ffff830137beb010 (XEN) r15: ffff82c480261720 cr0: 0000000080050033 cr4: 00000000000026f0 (XEN) cr3: 000000013c17d000 cr2: ffff8800ac6ef8f0 (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 (XEN) Xen stack trace from rsp=ffff83011415fcf8: (XEN) 00000000000af257 0000000800000001 ffff8300ba4fd000 0000000000000000 (XEN) 0000000000000000 0000000000000000 0000000000000002 ffff8800ac6ef8f0 (XEN) 0000000800000000 00000001318e0025 0000000000000087 ffff83011415fd68 (XEN) ffff82c480124f79 ffff83011415fd98 ffff83011415fda8 00007fda88d1e790 (XEN) ffff8800ac6ef8f0 00000001318e0025 0000000000000000 0000000000000000 (XEN) 0000000000000000 0000000000000000 0000000000000146 ffff830137b4d940 (XEN) 0000000000000001 ffff830137b4d950 ffff830137beb010 ffff82c480261720 (XEN) ffff83011415fe48 ffff82c48011a51b 0002000e00000007 ffffffff81009071 (XEN) 000000000000e033 ffff83013a805360 ffff880002bb3c28 000000000000e02b (XEN) e4d87248e7ca5f52 ffff830102ae2200 0000000000000001 ffff82c48011a356 (XEN) 00000008efa1f543 00007fda88d1e790 ffff83011415fe78 ffff82c48012748f (XEN) 0000000000000002 ffff830137beb028 ffff830102ae2200 ffff830137beb8d0 (XEN) ffff83011415fec8 ffff82c48012758b ffff830114150000 ffff8800ac6ef8f0 (XEN) 80100000ae86d065 ffff82c4802e0080 ffff82c4802e0000 ffff830114158000 (XEN) ffffffffffffffff 00007fda88d1e790 ffff83011415fef8 ffff82c480124b4e (XEN) ffff8300ba4fd000 ffffea0000063800 00000001318e0025 ffff8800ac6ef8f0 (XEN) ffff83011415ff08 ffff82c480124bb4 00007cfeebea00c7 ffff82c480226a71 (XEN) 00007fda88d1e790 ffff8800ac6ef8f0 00000001318e0025 ffffea0000063800 (XEN) ffff880002bb3c78 00000001318e0025 ffffea0000063800 0000000000000146 (XEN) 00003ffffffff000 ffffea0002b1bbf0 0000000000000000 00000001318e0025 (XEN) Xen call trace: (XEN) [<ffff82c480119e9e>] _csched_cpu_pick+0x155/0x5fd (XEN) [<ffff82c48011a51b>] csched_tick+0x1c5/0x342 (XEN) [<ffff82c48012748f>] execute_timer+0x4e/0x6c (XEN) [<ffff82c48012758b>] timer_softirq_action+0xde/0x206 (XEN) [<ffff82c480124b4e>] __do_softirq+0x8e/0x99 (XEN) [<ffff82c480124bb4>] do_softirq+0x13/0x15 (XEN) (XEN) (XEN) **************************************** (XEN) Panic on CPU 1: (XEN) Assertion ''!cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus)'' failed at sched_credit.c:507 (XEN) **************************************** (XEN) (XEN) Reboot in five seconds... ^ reason for above being that "cpus" cpumask is empty as it is a logical "and" between cpupool''s valid cpus (from which the cpu was removed) and cpu affinity mask. Attached patch follows the spirit of the changeset 25079:d5ccb2d1dbd1 (which blocked removal of the cpu from the cpupool in cpupool.c) by also blocking it''s removal from the cpupool''s valid cpumask. So cpu affinities are still preserved across suspend/resume, and scheuduler does not need to be disabled, as per original intent (I think). Would welcome comments. Signed-off-by: Tomasz Wroblewski <tomasz.wroblewski@citrix.com> Commit message: Fix s3 resume regression (crash in scheduler) after c-s 25079:d5ccb2d1dbd1 by also blocking removal of the cpu from the cpupool''s cpu_valid mask - in the spirit of mentioned c-s. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
>>> On 23.01.13 at 16:51, Tomasz Wroblewski <tomasz.wroblewski@citrix.com> wrote: > Hi all, > > This was also discussed earlier, for example here > http://xen.markmail.org/thread/iqvkylp3mclmsnbw > > Changeset 25079:d5ccb2d1dbd1 (Introduce system_state variable) added a > global variable, which, among other things, is used to prevent disabling > cpu scheduler, prevent breaking vcpu affinities, prevent removing the > cpu from cpupool on suspend. However, it missed one place where cpu is > removed from the cpupool valid cpus mask, in smpboot.c, __cpu_disable(), > line 840: > > cpumask_clear_cpu(cpu, cpupool0->cpu_valid); > > This causes the vcpu in the default pool to be considered inactive, and > the following assertion is violated in sched_credit.c soon after resume > transitions out of xen, causing a platform reboot: > > (XEN) Finishing wakeup from ACPI S3 state. > (XEN) Enabling non-boot CPUs ... > (XEN) Assertion ''!cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus)'' > failed at sched_credit.c:507 > (XEN) ----[ Xen-4.3-unstable x86_64 debug=y Tainted: C ]---- > (XEN) CPU: 1 > (XEN) RIP: e008:[<ffff82c480119e9e>] _csched_cpu_pick+0x155/0x5fd > (XEN) RFLAGS: 0000000000010202 CONTEXT: hypervisor > (XEN) rax: 0000000000000001 rbx: 0000000000000008 rcx: 0000000000000008 > (XEN) rdx: 00000000000000ff rsi: 0000000000000008 rdi: 0000000000000000 > (XEN) rbp: ffff83011415fdd8 rsp: ffff83011415fcf8 r8: 0000000000000000 > (XEN) r9: 000000000000003e r10: 00000008f3de731f r11: ffffea0000063800 > (XEN) r12: ffff82c480261720 r13: ffff830137b4d950 r14: ffff830137beb010 > (XEN) r15: ffff82c480261720 cr0: 0000000080050033 cr4: 00000000000026f0 > (XEN) cr3: 000000013c17d000 cr2: ffff8800ac6ef8f0 > (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 > (XEN) Xen stack trace from rsp=ffff83011415fcf8: > (XEN) 00000000000af257 0000000800000001 ffff8300ba4fd000 0000000000000000 > (XEN) 0000000000000000 0000000000000000 0000000000000002 ffff8800ac6ef8f0 > (XEN) 0000000800000000 00000001318e0025 0000000000000087 ffff83011415fd68 > (XEN) ffff82c480124f79 ffff83011415fd98 ffff83011415fda8 00007fda88d1e790 > (XEN) ffff8800ac6ef8f0 00000001318e0025 0000000000000000 0000000000000000 > (XEN) 0000000000000000 0000000000000000 0000000000000146 ffff830137b4d940 > (XEN) 0000000000000001 ffff830137b4d950 ffff830137beb010 ffff82c480261720 > (XEN) ffff83011415fe48 ffff82c48011a51b 0002000e00000007 ffffffff81009071 > (XEN) 000000000000e033 ffff83013a805360 ffff880002bb3c28 000000000000e02b > (XEN) e4d87248e7ca5f52 ffff830102ae2200 0000000000000001 ffff82c48011a356 > (XEN) 00000008efa1f543 00007fda88d1e790 ffff83011415fe78 ffff82c48012748f > (XEN) 0000000000000002 ffff830137beb028 ffff830102ae2200 ffff830137beb8d0 > (XEN) ffff83011415fec8 ffff82c48012758b ffff830114150000 ffff8800ac6ef8f0 > (XEN) 80100000ae86d065 ffff82c4802e0080 ffff82c4802e0000 ffff830114158000 > (XEN) ffffffffffffffff 00007fda88d1e790 ffff83011415fef8 ffff82c480124b4e > (XEN) ffff8300ba4fd000 ffffea0000063800 00000001318e0025 ffff8800ac6ef8f0 > (XEN) ffff83011415ff08 ffff82c480124bb4 00007cfeebea00c7 ffff82c480226a71 > (XEN) 00007fda88d1e790 ffff8800ac6ef8f0 00000001318e0025 ffffea0000063800 > (XEN) ffff880002bb3c78 00000001318e0025 ffffea0000063800 0000000000000146 > (XEN) 00003ffffffff000 ffffea0002b1bbf0 0000000000000000 00000001318e0025 > (XEN) Xen call trace: > (XEN) [<ffff82c480119e9e>] _csched_cpu_pick+0x155/0x5fd > (XEN) [<ffff82c48011a51b>] csched_tick+0x1c5/0x342 > (XEN) [<ffff82c48012748f>] execute_timer+0x4e/0x6c > (XEN) [<ffff82c48012758b>] timer_softirq_action+0xde/0x206 > (XEN) [<ffff82c480124b4e>] __do_softirq+0x8e/0x99 > (XEN) [<ffff82c480124bb4>] do_softirq+0x13/0x15 > (XEN) > (XEN) > (XEN) **************************************** > (XEN) Panic on CPU 1: > (XEN) Assertion ''!cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus)'' > failed at sched_credit.c:507 > (XEN) **************************************** > (XEN) > (XEN) Reboot in five seconds... > > ^ reason for above being that "cpus" cpumask is empty as it is a logical > "and" between cpupool''s valid cpus (from which the cpu was removed) and > cpu affinity mask.So can you confirm that this is not a problem on a non-debug hypervisor? I''m particularly asking because, leaving the ASSERT() aside, such a fundamental flaw would have made it impossible for S3 to work for anyone, and that''s reportedly not the case.> Attached patch follows the spirit of the changeset 25079:d5ccb2d1dbd1 > (which blocked removal of the cpu from the cpupool in cpupool.c) by also > blocking it''s removal from the cpupool''s valid cpumask. So cpu > affinities are still preserved across suspend/resume, and scheuduler > does not need to be disabled, as per original intent (I think). Would > welcome comments.Looks reasonable (and consistent with the earlier change), but I''d still like to wait for at least Keir''s and Juergen''s opinion. Jan> Signed-off-by: Tomasz Wroblewski <tomasz.wroblewski@citrix.com> > > Commit message: > Fix s3 resume regression (crash in scheduler) after c-s > 25079:d5ccb2d1dbd1 by also blocking removal of the cpu from the > cpupool''s cpu_valid mask - in the spirit of mentioned c-s.
On 23/01/13 17:11, Jan Beulich wrote:> So can you confirm that this is not a problem on a non-debug > hypervisor? I''m particularly asking because, leaving the ASSERT() > aside, such a fundamental flaw would have made it impossible > for S3 to work for anyone, and that''s reportedly not the case. >Jan, I tested with non debug hypervisor, without the patch, and it seems I get one of two outcomes: either it works, but the cpupool is emptied of all cpus but cpu 0 (so I presume nothing will get scheduled on them after resume), evidenced by "dump run queues" debug key, or I get the following crash: (XEN) ----[ Xen-4.3-unstable x86_64 debug=n Tainted: C ]---- (XEN) CPU: 0 (XEN) RIP: e008:[<ffff82c48012113a>] vcpu_wake+0x4a/0x3b0 (XEN) RFLAGS: 0000000000010092 CONTEXT: hypervisor (XEN) rax: 00007d3b7fd10180 rbx: ffff8300ba6f3000 rcx: 0000000000000000 (XEN) rdx: 0000000000000000 rsi: 0000000000000040 rdi: ffff8300ba6f3000 (XEN) rbp: ffff82c4802da1c0 rsp: ffff82c4802af998 r8: 0000000000000016 (XEN) r9: ffff8301300bc0c8 r10: 0000000000000002 r11: ffff83013797d010 (XEN) r12: ffff82c4802efee0 r13: 0000000000000092 r14: ffff82c4802efee0 (XEN) r15: ffff82c4802da1c0 cr0: 000000008005003b cr4: 00000000000026f0 (XEN) cr3: 00000000ba65d000 cr2: 0000000000000060 (XEN) ds: 002b es: 002b fs: 0000 gs: 0000 ss: e010 cs: e008 (XEN) Xen stack trace from rsp=ffff82c4802af998: (XEN) ffff82c4802afbc0 ffffffffffffff49 0000000000000000 ffff82c400000010 (XEN) 0000000800000008 ffff8300ba6f3000 0000000000000000 000000000000001b (XEN) ffff830137984000 ffff83013c5e00f0 0000000000000001 ffff82c48015b6fd (XEN) 000000000013c4be ffff830137984000 0000000000000027 ffff82c480106375 (XEN) ffff82c4802a8000 7400000000000000 ffff8301379c1b80 ffff82c4802c7800 (XEN) ffff82c4802f02e8 ffff82c480165a2b ffff8800ba60e3c8 ffff8301300bc800 (XEN) 000000000000001b ffff82c4802afbf8 ffff8301379c1ba4 0000000000000000 (XEN) ffffffffffff8000 2000000000000001 ffff82c4802afab8 ffff82c4802c7800 (XEN) ffff82c4802f02e8 0000000000000000 ffff8301379c0080 ffff82c4802afbf8 (XEN) 00000000000000f0 ffff82c48015d507 00000000000000f0 ffff82c4802afbf8 (XEN) ffff8301379c0080 0000000000000000 ffff82c4802f02e8 ffff82c4802c7800 (XEN) 0000000000000400 0000000000000002 0000000000000000 0000000000000002 (XEN) ffff82c4802efea0 ffff82c48024dfc0 ffff82c4802a8000 ffff8301379c00a4 (XEN) ffff8301379c00a4 0000006900000000 ffff82c480165cf7 000000000000e008 (XEN) 0000000000000202 ffff82c4802afb78 000000000000e010 ffff82c480165cf7 (XEN) 0000000000000096 0000000000000000 ffff82c48024dfc0 0000000000000000 (XEN) ffff8301379c00a4 0000000000000092 0000000000000096 ffff82c48013bb56 (XEN) ffff82c480263700 0000000000000000 ffff82c4802efea0 0000000000000003 (XEN) ffff82c4802f0290 ffff82c4802a8000 00000000000026f0 ffff82c48015d507 (XEN) 00000000000026f0 ffff82c4802a8000 ffff82c4802f0290 0000000000000003 (XEN) Xen call trace: (XEN) [<ffff82c48012113a>] vcpu_wake+0x4a/0x3b0 (XEN) [<ffff82c48015b6fd>] vcpu_kick+0x1d/0x80 (XEN) [<ffff82c480106375>] evtchn_set_pending+0x105/0x180 (XEN) [<ffff82c480165a2b>] do_IRQ+0x1db/0x5e0 (XEN) [<ffff82c48015d507>] common_interrupt+0x57/0x60 (XEN) [<ffff82c480165cf7>] do_IRQ+0x4a7/0x5e0 (XEN) [<ffff82c480165cf7>] do_IRQ+0x4a7/0x5e0 (XEN) [<ffff82c48013bb56>] __serial_putc+0x76/0x160 (XEN) [<ffff82c48015d507>] common_interrupt+0x57/0x60 (XEN) [<ffff82c48019a0ef>] enter_state+0x29f/0x390 (XEN) [<ffff82c480139ce0>] serial_rx+0/0xa0 (XEN) [<ffff82c480100000>] _stext+0/0x14 (XEN) [<ffff82c480110b57>] handle_keypress+0x67/0xd0 (XEN) [<ffff82c48013bca6>] serial_rx_interrupt+0x66/0xe0 (XEN) [<ffff82c48013a1db>] __ns16550_poll+0x4b/0xb0 (XEN) [<ffff82c48013a190>] __ns16550_poll+0/0xb0 (XEN) [<ffff82c480139fd6>] ns16550_poll+0x26/0x30 (XEN) [<ffff82c480182499>] do_invalid_op+0x3d9/0x3f0 (XEN) [<ffff82c480139fb0>] ns16550_poll+0/0x30 (XEN) [<ffff82c480216355>] handle_exception_saved+0x2e/0x6c (XEN) [<ffff82c480139fb0>] ns16550_poll+0/0x30 (XEN) [<ffff82c480139fcc>] ns16550_poll+0x1c/0x30 (XEN) [<ffff82c480125a3b>] timer_softirq_action+0xbb/0x2b0 (XEN) [<ffff82c48012311a>] __do_softirq+0x5a/0x90 (XEN) [<ffff82c480156e95>] idle_loop+0x25/0x50 (XEN) (XEN) Pagetable walk from 0000000000000060: (XEN) L4[0x000] = 000000013a80d063 ffffffffffffffff (XEN) L3[0x000] = 000000013a80c063 ffffffffffffffff (XEN) L2[0x000] = 000000013a80b063 ffffffffffffffff (XEN) L1[0x000] = 0000000000000000 ffffffffffffffff (XEN) (XEN) **************************************** (XEN) Panic on CPU 0: (XEN) FATAL PAGE FAULT (XEN) [error_code=0000] (XEN) Faulting linear address: 0000000000000060 (XEN) **************************************** With the patch, there''s no crash and the cpupool correctly contains 4 cpus after resume (same as before suspend). Sidenote: I was also getting the vcpu_wake crashes earlier, on debug versions, when I just commented out the asserts for test purposes. So just assumed its a consequence of a violated assert later on.
Aha, also don''t be offended by the stacktraces attached having a serial port activity in the trace, I had to get them by hooking the most of xen suspend logic (without actual do_lowlevel_suspend) onto a debug key instead of using real s3 - since I can''t otherwise get resume serial trace on this laptop, which uses pci serial card. But I did test the resulting patch against real s3 cycle.
Okay, sadly after more testing it seems I am getting the same vcpu_wake crash, even with patch (although its somewhat rare) From the disassembly it actually looks to be crashing somewhere in vcpu_schedule_lock(), which gets inlined. So whilst patch fixes the fact that cpus are removed from cpupool0 after resume, the crashiness remains, which might be a separate bug - so I''m going to attempt to dig here more.
Am 23.01.2013 16:51, schrieb Tomasz Wroblewski:> Hi all, > > This was also discussed earlier, for example here > http://xen.markmail.org/thread/iqvkylp3mclmsnbw > > Changeset 25079:d5ccb2d1dbd1 (Introduce system_state variable) added a > global variable, which, among other things, is used to prevent disabling > cpu scheduler, prevent breaking vcpu affinities, prevent removing the > cpu from cpupool on suspend. However, it missed one place where cpu is > removed from the cpupool valid cpus mask, in smpboot.c, __cpu_disable(), > line 840: > > cpumask_clear_cpu(cpu, cpupool0->cpu_valid); > > This causes the vcpu in the default pool to be considered inactive, and > the following assertion is violated in sched_credit.c soon after resume > transitions out of xen, causing a platform reboot: > > (XEN) Finishing wakeup from ACPI S3 state. > (XEN) Enabling non-boot CPUs ... > (XEN) Assertion ''!cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus)'' > failed at sched_credit.c:507 > (XEN) ----[ Xen-4.3-unstable x86_64 debug=y Tainted: C ]---- > (XEN) CPU: 1 > (XEN) RIP: e008:[<ffff82c480119e9e>] _csched_cpu_pick+0x155/0x5fd > (XEN) RFLAGS: 0000000000010202 CONTEXT: hypervisor > (XEN) rax: 0000000000000001 rbx: 0000000000000008 rcx: 0000000000000008 > (XEN) rdx: 00000000000000ff rsi: 0000000000000008 rdi: 0000000000000000 > (XEN) rbp: ffff83011415fdd8 rsp: ffff83011415fcf8 r8: 0000000000000000 > (XEN) r9: 000000000000003e r10: 00000008f3de731f r11: ffffea0000063800 > (XEN) r12: ffff82c480261720 r13: ffff830137b4d950 r14: ffff830137beb010 > (XEN) r15: ffff82c480261720 cr0: 0000000080050033 cr4: 00000000000026f0 > (XEN) cr3: 000000013c17d000 cr2: ffff8800ac6ef8f0 > (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 > (XEN) Xen stack trace from rsp=ffff83011415fcf8: > (XEN) 00000000000af257 0000000800000001 ffff8300ba4fd000 0000000000000000 > (XEN) 0000000000000000 0000000000000000 0000000000000002 ffff8800ac6ef8f0 > (XEN) 0000000800000000 00000001318e0025 0000000000000087 ffff83011415fd68 > (XEN) ffff82c480124f79 ffff83011415fd98 ffff83011415fda8 00007fda88d1e790 > (XEN) ffff8800ac6ef8f0 00000001318e0025 0000000000000000 0000000000000000 > (XEN) 0000000000000000 0000000000000000 0000000000000146 ffff830137b4d940 > (XEN) 0000000000000001 ffff830137b4d950 ffff830137beb010 ffff82c480261720 > (XEN) ffff83011415fe48 ffff82c48011a51b 0002000e00000007 ffffffff81009071 > (XEN) 000000000000e033 ffff83013a805360 ffff880002bb3c28 000000000000e02b > (XEN) e4d87248e7ca5f52 ffff830102ae2200 0000000000000001 ffff82c48011a356 > (XEN) 00000008efa1f543 00007fda88d1e790 ffff83011415fe78 ffff82c48012748f > (XEN) 0000000000000002 ffff830137beb028 ffff830102ae2200 ffff830137beb8d0 > (XEN) ffff83011415fec8 ffff82c48012758b ffff830114150000 ffff8800ac6ef8f0 > (XEN) 80100000ae86d065 ffff82c4802e0080 ffff82c4802e0000 ffff830114158000 > (XEN) ffffffffffffffff 00007fda88d1e790 ffff83011415fef8 ffff82c480124b4e > (XEN) ffff8300ba4fd000 ffffea0000063800 00000001318e0025 ffff8800ac6ef8f0 > (XEN) ffff83011415ff08 ffff82c480124bb4 00007cfeebea00c7 ffff82c480226a71 > (XEN) 00007fda88d1e790 ffff8800ac6ef8f0 00000001318e0025 ffffea0000063800 > (XEN) ffff880002bb3c78 00000001318e0025 ffffea0000063800 0000000000000146 > (XEN) 00003ffffffff000 ffffea0002b1bbf0 0000000000000000 00000001318e0025 > (XEN) Xen call trace: > (XEN) [<ffff82c480119e9e>] _csched_cpu_pick+0x155/0x5fd > (XEN) [<ffff82c48011a51b>] csched_tick+0x1c5/0x342 > (XEN) [<ffff82c48012748f>] execute_timer+0x4e/0x6c > (XEN) [<ffff82c48012758b>] timer_softirq_action+0xde/0x206 > (XEN) [<ffff82c480124b4e>] __do_softirq+0x8e/0x99 > (XEN) [<ffff82c480124bb4>] do_softirq+0x13/0x15 > (XEN) > (XEN) > (XEN) **************************************** > (XEN) Panic on CPU 1: > (XEN) Assertion ''!cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus)'' > failed at sched_credit.c:507 > (XEN) **************************************** > (XEN) > (XEN) Reboot in five seconds... > > ^ reason for above being that "cpus" cpumask is empty as it is a logical > "and" between cpupool''s valid cpus (from which the cpu was removed) and > cpu affinity mask. > > Attached patch follows the spirit of the changeset 25079:d5ccb2d1dbd1 > (which blocked removal of the cpu from the cpupool in cpupool.c) by also > blocking it''s removal from the cpupool''s valid cpumask. So cpu > affinities are still preserved across suspend/resume, and scheuduler > does not need to be disabled, as per original intent (I think). Would > welcome comments. > > Signed-off-by: Tomasz Wroblewski <tomasz.wroblewski@citrix.com>Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com>> > Commit message: > Fix s3 resume regression (crash in scheduler) after c-s > 25079:d5ccb2d1dbd1 by also blocking removal of the cpu from the > cpupool''s cpu_valid mask - in the spirit of mentioned c-s. > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel-- Juergen Gross Principal Developer Operating Systems PBG PDG ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967 Fujitsu Technology Solutions e-mail: juergen.gross@ts.fujitsu.com Domagkstr. 28 Internet: ts.fujitsu.com D-80807 Muenchen Company details: ts.fujitsu.com/imprint.html
Hi, had to made another version of this patch which also fixes the additional interminent crash in vcpu_wake(). Would be grateful for comments / possible ack again. This crash was happening when vcpu_wake is called between disable_nonboot_cpus() and enable_nonboot_cpus() on s3 path (happens for example from the insides of rcu_barrier() sometime). It turns out that it was due to vcpu_schedule_lock() accessing per_cpu area, which however is freed at this point as it''s freed in percpu.c cpu down callback. I tried the approach of preserving per cpu area on cpu down/up (during s3), as well as testing for cpu being online in vcpu_wake before acquiring this lock, but ultimately although helping a bit these were not fully succesful. So concluded it''s probably not really correct to let the scheduler run rampart during the disable_nonboot_cpus() / enable_nonboot_cpus() window on s3 path and made a new patch version. Tested it across many s3 iterations (on lenovo T520), with no problems. It should be pretty uninvasive as it only touches S3 path. Changes from v1: - modified cpu_disable_scheduler (schedule.c) to run most of stop scheduler logic again (i.e. the vcpu migrate). This is partial revert of c-s 25079:d5ccb2d1dbd1 . However, breaking of domain vcpu affinities seems to be avoidable on this path, so added a condition to do just that - instead of skipping cpupool0->is_valid cpumask clear on suspend path, added restore of that bit on resume path. Moving this was needed because otherwise cpu_disable_scheduler() fails to migrate the vcpu (as it''s still in cpu_valid mask when it''s attempted), return EAGAIN and BUG() will fire in __cpu_disable() Signed-off-by: Tomasz Wroblewski <tomasz.wroblewski@citrix.com> Commit message: Fix S3 resume regression after C-S 25079:d5ccb2d1dbd1. Regression causes either an interminent crash in vcpu_wake (attempt to access vcpu_schedule_lock which is in freed per cpu area at this point), or, in debug xen, more frequent ASSERT(!cpumask_empty(&cpus)...) firing in _csched_cpu_pick. Fix this by reverting the hunk which turned off disabling cpu scheduler on suspend path. Additionally, avoid breaking domain vcpu affinities on suspend path. On resume, restore the frozen cpus in cpupool''s cpu_valid mask, so they can once again be used by scheduler. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
>>> On 24.01.13 at 15:26, Tomasz Wroblewski <tomasz.wroblewski@citrix.com> wrote: >@@ -212,6 +213,8 @@ > BUG_ON(error == -EBUSY); > printk("Error taking CPU%d up: %d\n", cpu, error); > } >+ if (system_state == SYS_STATE_resume) >+ cpumask_set_cpu(cpu, cpupool0->cpu_valid);This can''t be right: What tells you that all CPUs were in pool 0? Also, for the future - generating patches with -p helps quite a bit in reviewing them.>--- a/xen/common/schedule.c Mon Jan 21 17:03:10 2013 +0000 >+++ b/xen/common/schedule.c Thu Jan 24 13:40:31 2013 +0000 >@@ -545,7 +545,7 @@ > int ret = 0; > > c = per_cpu(cpupool, cpu); >- if ( (c == NULL) || (system_state == SYS_STATE_suspend) ) >+ if ( c == NULL ) > return ret; > > for_each_domain_in_cpupool ( d, c ) >@@ -556,7 +556,8 @@ > > cpumask_and(&online_affinity, v->cpu_affinity, c->cpu_valid); > if ( cpumask_empty(&online_affinity) && >- cpumask_test_cpu(cpu, v->cpu_affinity) ) >+ cpumask_test_cpu(cpu, v->cpu_affinity) && >+ system_state != SYS_STATE_suspend ) > { > printk("Breaking vcpu affinity for domain %d vcpu %d\n", > v->domain->domain_id, v->vcpu_id);I doubt this is correct, as you don''t restore any of the settings during resume that you tear down here. Jan
On 24/01/13 15:36, Jan Beulich wrote:>>>> On 24.01.13 at 15:26, Tomasz Wroblewski <tomasz.wroblewski@citrix.com> wrote: >> @@ -212,6 +213,8 @@ >> BUG_ON(error == -EBUSY); >> printk("Error taking CPU%d up: %d\n", cpu, error); >> } >> + if (system_state == SYS_STATE_resume) >> + cpumask_set_cpu(cpu, cpupool0->cpu_valid); > This can''t be right: What tells you that all CPUs were in pool 0? > > Also, for the future - generating patches with -p helps quite > a bit in reviewing them.If you''re using mercurial e-mail / mercurial queues, you can get this effect by adding the following lines to your ~/.hgrc: [diff] showfunc = True nodates = 1 git = 1 -George
Tomasz Wroblewski
2013-Jan-24 16:25 UTC
Re: [PATCH v2] Fix scheduler crash after s3 resume
On 24/01/13 16:36, Jan Beulich wrote:>>>> On 24.01.13 at 15:26, Tomasz Wroblewski<tomasz.wroblewski@citrix.com> wrote: >>>> >> @@ -212,6 +213,8 @@ >> BUG_ON(error == -EBUSY); >> printk("Error taking CPU%d up: %d\n", cpu, error); >> } >> + if (system_state == SYS_STATE_resume) >> + cpumask_set_cpu(cpu, cpupool0->cpu_valid); >> > This can''t be right: What tells you that all CPUs were in pool 0? > >You''re right, in my simple tests this was the case, but generally speaking it might not be.. Would an approach based on storing cpupool0 mask in disable_nonboot_cpus() and restoring it in enable_nonboot_cpus() be more acceptable?> Also, for the future - generating patches with -p helps quite > a bit in reviewing them. > >Ok, thanks!>> --- a/xen/common/schedule.c Mon Jan 21 17:03:10 2013 +0000 >> +++ b/xen/common/schedule.c Thu Jan 24 13:40:31 2013 +0000 >> @@ -545,7 +545,7 @@ >> int ret = 0; >> >> c = per_cpu(cpupool, cpu); >> - if ( (c == NULL) || (system_state == SYS_STATE_suspend) ) >> + if ( c == NULL ) >> return ret; >> >> for_each_domain_in_cpupool ( d, c ) >> @@ -556,7 +556,8 @@ >> >> cpumask_and(&online_affinity, v->cpu_affinity, c->cpu_valid); >> if ( cpumask_empty(&online_affinity)&& >> - cpumask_test_cpu(cpu, v->cpu_affinity) ) >> + cpumask_test_cpu(cpu, v->cpu_affinity)&& >> + system_state != SYS_STATE_suspend ) >> { >> printk("Breaking vcpu affinity for domain %d vcpu %d\n", >> v->domain->domain_id, v->vcpu_id); >> > I doubt this is correct, as you don''t restore any of the settings > during resume that you tear down here. > >Is the objection about the affinity part or also the (c == NULL) bit? The cpu_disable_scheduler() function is currently part of a regular cpu down process, and was also part of suspend process before the "system state variable" changeset which regressed it. So the (c==NULL) hunk mostly just returns to previous state where this was working alot better (by empirical testing). But I am no expert on this, so would be grateful for ideas how this could be fixed in a better way! Just to recap, the current problem boils down, I believe, to the fact that vcpu_wake (schedule.c) function keeps getting called occasionally during the S3 path for cpus which have the per_cpu data freed, causing a crash. Safest way of fixing it seemed to be just put the suspend cpu_disable_scheduler under regular path again - it probably isn''t the best..
>>> On 24.01.13 at 17:25, Tomasz Wroblewski <tomasz.wroblewski@citrix.com> wrote: > On 24/01/13 16:36, Jan Beulich wrote: >>>>> On 24.01.13 at 15:26, Tomasz Wroblewski<tomasz.wroblewski@citrix.com> wrote: >>>>> >>> @@ -212,6 +213,8 @@ >>> BUG_ON(error == -EBUSY); >>> printk("Error taking CPU%d up: %d\n", cpu, error); >>> } >>> + if (system_state == SYS_STATE_resume) >>> + cpumask_set_cpu(cpu, cpupool0->cpu_valid); >>> >> This can''t be right: What tells you that all CPUs were in pool 0? >> >> > You''re right, in my simple tests this was the case, but generally > speaking it might not be.. Would an approach based on storing cpupool0 > mask in disable_nonboot_cpus() and restoring it in enable_nonboot_cpus() > be more acceptable?As long as that doesn''t lead to other pools still come back corrupted after resume. You may need to work this out with Juergen.>>> --- a/xen/common/schedule.c Mon Jan 21 17:03:10 2013 +0000 >>> +++ b/xen/common/schedule.c Thu Jan 24 13:40:31 2013 +0000 >>> @@ -545,7 +545,7 @@ >>> int ret = 0; >>> >>> c = per_cpu(cpupool, cpu); >>> - if ( (c == NULL) || (system_state == SYS_STATE_suspend) ) >>> + if ( c == NULL ) >>> return ret; >>> >>> for_each_domain_in_cpupool ( d, c ) >>> @@ -556,7 +556,8 @@ >>> >>> cpumask_and(&online_affinity, v->cpu_affinity, c->cpu_valid); >>> if ( cpumask_empty(&online_affinity)&& >>> - cpumask_test_cpu(cpu, v->cpu_affinity) ) >>> + cpumask_test_cpu(cpu, v->cpu_affinity)&& >>> + system_state != SYS_STATE_suspend ) >>> { >>> printk("Breaking vcpu affinity for domain %d vcpu %d\n", >>> v->domain->domain_id, v->vcpu_id); >>> >> I doubt this is correct, as you don''t restore any of the settings >> during resume that you tear down here. >> >> > Is the objection about the affinity part or also the (c == NULL) bit? > The cpu_disable_scheduler() function is currently part of a regular cpu > down process, and was also part of suspend process before the "system > state variable" changeset which regressed it. So the (c==NULL) hunk > mostly just returns to previous state where this was working alot better > (by empirical testing). But I am no expert on this, so would be grateful > for ideas how this could be fixed in a better way!The change you''re about to partly revert was correcting one fundamental mistake: bringing down a CPU at runtime is a different thing than doing the same during suspend. In the former case you indeed want all associations to it to be cut off, whereas in the S3 case you want everything to come back at resume the way it was before suspend. So I think you''re just trying to revert to much of that original change. But again, Keir and Juergen (who collectively did that change iirc) would be good for you to consult with.> Just to recap, the current problem boils down, I believe, to the fact > that vcpu_wake (schedule.c) function keeps getting called occasionally > during the S3 path for cpus which have the per_cpu data freed, causing a > crash. Safest way of fixing it seemed to be just put the suspend > cpu_disable_scheduler under regular path again - it probably isn''t the > best..It certainly looks wrong for vcpu_wake() to be called on an offline CPU in the first place (i.e. I''m getting the impression you''re trying to cure symptoms rather than the root cause). Did you note down the call tree that got you there? Jan
Tomasz Wroblewski
2013-Jan-25 09:07 UTC
Re: [PATCH v2] Fix scheduler crash after s3 resume
> As long as that doesn''t lead to other pools still come back > corrupted after resume. > > You may need to work this out with Juergen. > >Nods. My intention is to just offset the cpupool0 valid mask clear done on __cpu_disable().> > The change you''re about to partly revert was correcting one > fundamental mistake: bringing down a CPU at runtime is a > different thing than doing the same during suspend. In the > former case you indeed want all associations to it to be cut off, > whereas in the S3 case you want everything to come back at > resume the way it was before suspend. So I think you''re just > trying to revert to much of that original change. > > But again, Keir and Juergen (who collectively did that change > iirc) would be good for you to consult with. > >Understood, it''d certainly be preferable if this could be fixed leaving the original intent in (I tried but so far failed). From my limited understanding of reading the scheduler code, it is actually the vcpu_migrate() function called from cpu_disable_scheduler which substitutes new v->processor in the vcpu struct, which in turn causes the vcpu_wakes to stop happening on these partially downed cpus, and hence avoids the crash.> It certainly looks wrong for vcpu_wake() to be called on an offline > CPU in the first place (i.e. I''m getting the impression you''re trying > to cure symptoms rather than the root cause). Did you note down > the call tree that got you there? > >Here''s an example one, taken from unpatched suspend path - was pretty hard to get as for some reason I am getting a lot of serial output eaten even with sync_console. I also seen it from few other places, believe the window of opportunity for it to happen is before the enable_nonboot_boot_cpus() is called in enter_state(). If it survives past this point it won''t crash. (XEN) Finishing wakeup from ACPI S3 state. (XEN) ----[ Xen-4.3-unstable x86_64 debug=y Tainted: C ]---- (XEN) CPU: 0 (XEN) RIP: e008:[<ffff82c480122c1a>] vcpu_wake+0x3a/0x40a (XEN) RFLAGS: 0000000000010082 CONTEXT: hypervisor (XEN) rax: 00007d3b7fcf7580 rbx: ffff8300ba70f000 rcx: ffff8301020180c8 (XEN) rdx: 0000000000000000 rsi: 0000000000000040 rdi: ffff8300ba70f000 (XEN) rbp: ffff82c4802c7bf0 rsp: ffff82c4802c7bb0 r8: 0000000000000002 (XEN) r9: ffff83013a805380 r10: 00000009713faecf r11: 000000000000000a (XEN) r12: ffff82c480308ae0 r13: ffff82c4802f2dc0 r14: 0000000000000246 (XEN) r15: 0000000000000082 cr0: 000000008005003b cr4: 00000000000026f0 (XEN) cr3: 00000000ba674000 cr2: 0000000000000060 (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008 (XEN) Xen stack trace from rsp=ffff82c4802c7bb0: (XEN) ffff82c4802c7c00 0000000000000286 ffff83013a8053a8 ffff8300ba70f000 (XEN) 0000000000000000 ffff8300ba70f15c 0000000000000246 ffff82c4802c7e3c (XEN) ffff82c4802c7c00 ffff82c48012337b ffff82c4802c7c20 ffff82c48015eaf6 (XEN) ffff830137b75000 0000000000000013 ffff82c4802c7c30 ffff82c48015eb75 (XEN) ffff82c4802c7c60 ffff82c4801067e5 ffff82c4802c7c60 ffff8300ba70f000 (XEN) 0000000000000000 ffff8300ba70f15c ffff82c4802c7c90 ffff82c480106a88 (XEN) ffff82c480308c80 ffff8300ba70f000 ffff82c480121590 00000009a5a50985 (XEN) ffff82c4802c7ca0 ffff82c480181af7 ffff82c4802c7cb0 ffff82c480121599 (XEN) ffff82c4802c7ce0 ffff82c4801274cf 0000000000000002 ffff8300ba70f060 (XEN) ffff82c480308c80 ffff830137beb240 ffff82c4802c7d30 ffff82c4801275cb (XEN) ffff82c480308e28 0000000000000286 0000000000000000 ffff82c4802e0000 (XEN) ffff82c4802e0000 ffff82c4802c0000 fffffffffffffffd ffff82c4802c7e3c (XEN) ffff82c4802c7d60 ffff82c480124b8e 0000000000000000 ffff82c4802655e0 (XEN) 0000000000000003 0000000000000100 ffff82c4802c7d70 ffff82c480124bdf (XEN) ffff82c4802c7db0 ffff82c48012cd1f ffff82c4802c7da0 0000000000000000 (XEN) ffff82c48012c79c ffff82c4802c7e3c 0000000000000008 0000000000000000 (XEN) ffff82c4802c7e20 ffff82c480125a77 ffff82c4802c7e40 ffff82c48012cccb (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000 (XEN) ffff82c4802c7e00 ffff82c4802c0000 0000000000000000 0000000000000003 (XEN) ffff82c480308e90 00000000000026f0 ffff82c4802c7e40 ffff82c48012cbf8 (XEN) Xen call trace: (XEN) [<ffff82c480122c1a>] vcpu_wake+0x3a/0x40a (XEN) [<ffff82c48012337b>] vcpu_unblock+0x4b/0x4d (XEN) [<ffff82c48015eaf6>] vcpu_kick+0x20/0x71 (XEN) [<ffff82c48015eb75>] vcpu_mark_events_pending+0x2e/0x39 (XEN) [<ffff82c4801067e5>] evtchn_set_pending+0xba/0x185 (XEN) [<ffff82c480106a88>] send_guest_vcpu_virq+0x62/0x7f (XEN) [<ffff82c480181af7>] send_timer_event+0xe/0x10 (XEN) [<ffff82c480121599>] vcpu_singleshot_timer_fn+0x9/0xb (XEN) [<ffff82c4801274cf>] execute_timer+0x4e/0x6c (XEN) [<ffff82c4801275cb>] timer_softirq_action+0xde/0x206 (XEN) [<ffff82c480124b8e>] __do_softirq+0x8e/0x99 (XEN) [<ffff82c480124bdf>] process_pending_softirqs+0x46/0x48 (XEN) [<ffff82c48012cd1f>] rcu_barrier_action+0x54/0x7d (XEN) [<ffff82c480125a77>] stop_machine_run+0x1b5/0x1fe (XEN) [<ffff82c48012cbf8>] rcu_barrier+0x24/0x26 (XEN) [<ffff82c48019dae4>] enter_state+0x2cb/0x36b (XEN) [<ffff82c48019db9c>] enter_state_action+0x18/0x24 (XEN) [<ffff82c480126ac6>] do_tasklet_work+0x8d/0xc7 (XEN) [<ffff82c480126e14>] do_tasklet+0x65/0x95 (XEN) [<ffff82c480159945>] idle_loop+0x63/0x6a (XEN) (XEN) Pagetable walk from 0000000000000060: (XEN) L4[0x000] = 000000013a808063 ffffffffffffffff (XEN) L3[0x000] = 000000013a807063 ffffffffffffffff (XEN) L2[0x000] = 000000013a806063 ffffffffffffffff (XEN) L1[0x000] = 0000000000000000 ffffffffffffffff (XEN) (XEN) **************************************** (XEN) Panic on CPU 0: (XEN) FATAL PAGE FAULT (XEN) [error_code=0000] (XEN) Faulting linear address: 0000000000000060 (XEN) ****************************************
>>> On 25.01.13 at 10:07, Tomasz Wroblewski <tomasz.wroblewski@citrix.com> wrote: >> It certainly looks wrong for vcpu_wake() to be called on an offline >> CPU in the first place (i.e. I''m getting the impression you''re trying >> to cure symptoms rather than the root cause). Did you note down >> the call tree that got you there? >> > Here''s an example one, taken from unpatched suspend path - was pretty > hard to get as for some reason I am getting a lot of serial output eaten > even with sync_console. I also seen it from few other places, believe > the window of opportunity for it to happen is before the > enable_nonboot_boot_cpus() is called in enter_state(). If it survives > past this point it won''t crash. > > > (XEN) Finishing wakeup from ACPI S3 state. > (XEN) ----[ Xen-4.3-unstable x86_64 debug=y Tainted: C ]---- > (XEN) CPU: 0 > (XEN) RIP: e008:[<ffff82c480122c1a>] vcpu_wake+0x3a/0x40a > (XEN) RFLAGS: 0000000000010082 CONTEXT: hypervisor > (XEN) rax: 00007d3b7fcf7580 rbx: ffff8300ba70f000 rcx: ffff8301020180c8 > (XEN) rdx: 0000000000000000 rsi: 0000000000000040 rdi: ffff8300ba70f000 > (XEN) rbp: ffff82c4802c7bf0 rsp: ffff82c4802c7bb0 r8: 0000000000000002 > (XEN) r9: ffff83013a805380 r10: 00000009713faecf r11: 000000000000000a > (XEN) r12: ffff82c480308ae0 r13: ffff82c4802f2dc0 r14: 0000000000000246 > (XEN) r15: 0000000000000082 cr0: 000000008005003b cr4: 00000000000026f0 > (XEN) cr3: 00000000ba674000 cr2: 0000000000000060 > (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008 > (XEN) Xen stack trace from rsp=ffff82c4802c7bb0: > (XEN) ffff82c4802c7c00 0000000000000286 ffff83013a8053a8 ffff8300ba70f000 > (XEN) 0000000000000000 ffff8300ba70f15c 0000000000000246 ffff82c4802c7e3c > (XEN) ffff82c4802c7c00 ffff82c48012337b ffff82c4802c7c20 ffff82c48015eaf6 > (XEN) ffff830137b75000 0000000000000013 ffff82c4802c7c30 ffff82c48015eb75 > (XEN) ffff82c4802c7c60 ffff82c4801067e5 ffff82c4802c7c60 ffff8300ba70f000 > (XEN) 0000000000000000 ffff8300ba70f15c ffff82c4802c7c90 ffff82c480106a88 > (XEN) ffff82c480308c80 ffff8300ba70f000 ffff82c480121590 00000009a5a50985 > (XEN) ffff82c4802c7ca0 ffff82c480181af7 ffff82c4802c7cb0 ffff82c480121599 > (XEN) ffff82c4802c7ce0 ffff82c4801274cf 0000000000000002 ffff8300ba70f060 > (XEN) ffff82c480308c80 ffff830137beb240 ffff82c4802c7d30 ffff82c4801275cb > (XEN) ffff82c480308e28 0000000000000286 0000000000000000 ffff82c4802e0000 > (XEN) ffff82c4802e0000 ffff82c4802c0000 fffffffffffffffd ffff82c4802c7e3c > (XEN) ffff82c4802c7d60 ffff82c480124b8e 0000000000000000 ffff82c4802655e0 > (XEN) 0000000000000003 0000000000000100 ffff82c4802c7d70 ffff82c480124bdf > (XEN) ffff82c4802c7db0 ffff82c48012cd1f ffff82c4802c7da0 0000000000000000 > (XEN) ffff82c48012c79c ffff82c4802c7e3c 0000000000000008 0000000000000000 > (XEN) ffff82c4802c7e20 ffff82c480125a77 ffff82c4802c7e40 ffff82c48012cccb > (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > (XEN) ffff82c4802c7e00 ffff82c4802c0000 0000000000000000 0000000000000003 > (XEN) ffff82c480308e90 00000000000026f0 ffff82c4802c7e40 ffff82c48012cbf8 > (XEN) Xen call trace: > (XEN) [<ffff82c480122c1a>] vcpu_wake+0x3a/0x40a > (XEN) [<ffff82c48012337b>] vcpu_unblock+0x4b/0x4d > (XEN) [<ffff82c48015eaf6>] vcpu_kick+0x20/0x71 > (XEN) [<ffff82c48015eb75>] vcpu_mark_events_pending+0x2e/0x39 > (XEN) [<ffff82c4801067e5>] evtchn_set_pending+0xba/0x185 > (XEN) [<ffff82c480106a88>] send_guest_vcpu_virq+0x62/0x7f > (XEN) [<ffff82c480181af7>] send_timer_event+0xe/0x10 > (XEN) [<ffff82c480121599>] vcpu_singleshot_timer_fn+0x9/0xb > (XEN) [<ffff82c4801274cf>] execute_timer+0x4e/0x6c > (XEN) [<ffff82c4801275cb>] timer_softirq_action+0xde/0x206 > (XEN) [<ffff82c480124b8e>] __do_softirq+0x8e/0x99 > (XEN) [<ffff82c480124bdf>] process_pending_softirqs+0x46/0x48 > (XEN) [<ffff82c48012cd1f>] rcu_barrier_action+0x54/0x7d > (XEN) [<ffff82c480125a77>] stop_machine_run+0x1b5/0x1fe > (XEN) [<ffff82c48012cbf8>] rcu_barrier+0x24/0x26I think I had already raised the question of the placement of this rcu_barrier() here, and the lack of a counterpart in the suspend portion of the path. Keir? Or should rcu_barrier_action() avoid calling process_pending_softirqs() while still resuming, and instead call __do_softirq() with all but RCU_SOFTIRQ masked (perhaps through a suitable wrapper, or alternatively by open-coding its effect)? Jan> (XEN) [<ffff82c48019dae4>] enter_state+0x2cb/0x36b > (XEN) [<ffff82c48019db9c>] enter_state_action+0x18/0x24 > (XEN) [<ffff82c480126ac6>] do_tasklet_work+0x8d/0xc7 > (XEN) [<ffff82c480126e14>] do_tasklet+0x65/0x95 > (XEN) [<ffff82c480159945>] idle_loop+0x63/0x6a > (XEN) > (XEN) Pagetable walk from 0000000000000060: > (XEN) L4[0x000] = 000000013a808063 ffffffffffffffff > (XEN) L3[0x000] = 000000013a807063 ffffffffffffffff > (XEN) L2[0x000] = 000000013a806063 ffffffffffffffff > (XEN) L1[0x000] = 0000000000000000 ffffffffffffffff > (XEN) > (XEN) **************************************** > (XEN) Panic on CPU 0: > (XEN) FATAL PAGE FAULT > (XEN) [error_code=0000] > (XEN) Faulting linear address: 0000000000000060 > (XEN) ****************************************
Tomasz Wroblewski
2013-Jan-25 09:45 UTC
Re: [PATCH v2] Fix scheduler crash after s3 resume
> I think I had already raised the question of the placement of > this rcu_barrier() here, and the lack of a counterpart in the > suspend portion of the path. Keir? Or should > rcu_barrier_action() avoid calling process_pending_softirqs() > while still resuming, and instead call __do_softirq() with all but > RCU_SOFTIRQ masked (perhaps through a suitable wrapper, > or alternatively by open-coding its effect)? > >Though I recall these vcpu_wake crashes happen also from other entry points in enter_state but rcu_barrier, so I dont think removing that helps much. Just was unable to get a proper log of them today due to most of them being cut in half. Will try bit more. My belief is that as long as vcpu_migrate is not called in cpu_disable_scheduler, the vcpu->processor shall continue to point to offline cpu. Which will crash if the vcpu_wake is called for that vcpu. If vcpu_migrate is called, then vcpu_wake will still be called with some frequency but since vcpu->processor shall point to online cpu, and it won''t crash. So likely avoiding the wakes here completely is not the goal, just the offline ones.
>>> On 25.01.13 at 10:45, Tomasz Wroblewski <tomasz.wroblewski@citrix.com> wrote:>> I think I had already raised the question of the placement of >> this rcu_barrier() here, and the lack of a counterpart in the >> suspend portion of the path. Keir? Or should >> rcu_barrier_action() avoid calling process_pending_softirqs() >> while still resuming, and instead call __do_softirq() with all but >> RCU_SOFTIRQ masked (perhaps through a suitable wrapper, >> or alternatively by open-coding its effect)? >> > Though I recall these vcpu_wake crashes happen also from other entry > points in enter_state but rcu_barrier, so I dont think removing that > helps much. Just was unable to get a proper log of them today due to > most of them being cut in half. Will try bit more.In which case making __do_softirq() itself honor being in the suspend/resume path might still be an option.> My belief is that as long as vcpu_migrate is not called in > cpu_disable_scheduler, the vcpu->processor shall continue to point to > offline cpu. Which will crash if the vcpu_wake is called for that vcpu. > If vcpu_migrate is called, then vcpu_wake will still be called with some > frequency but since vcpu->processor shall point to online cpu, and it > won''t crash. So likely avoiding the wakes here completely is not the > goal, just the offline ones.But you neglect the fact that waking vCPU-s at this point is unnecessary anyway (they have nowhere to run on). Jan
Tomasz Wroblewski
2013-Jan-25 10:18 UTC
Re: [PATCH v2] Fix scheduler crash after s3 resume
> But you neglect the fact that waking vCPU-s at this point is > unnecessary anyway (they have nowhere to run on). > >Hm, can''t they still run on cpu 0? Also, I did earlier try to patch this more primitively by adding a check to vcpu_wake whether the cpu is in cpu_online_map, and whilst it stopped the crashing, I observed some unexpected hangs w/o any serial output - sometimes just the dom0 would hang, sometimes also xen would stop being responsive on serial. So I figured out that these missed vcpu wakes account for something still.
Am 25.01.2013 11:15, schrieb Jan Beulich:>>>> On 25.01.13 at 10:45, Tomasz Wroblewski<tomasz.wroblewski@citrix.com> wrote: > >>> I think I had already raised the question of the placement of >>> this rcu_barrier() here, and the lack of a counterpart in the >>> suspend portion of the path. Keir? Or should >>> rcu_barrier_action() avoid calling process_pending_softirqs() >>> while still resuming, and instead call __do_softirq() with all but >>> RCU_SOFTIRQ masked (perhaps through a suitable wrapper, >>> or alternatively by open-coding its effect)? >>> >> Though I recall these vcpu_wake crashes happen also from other entry >> points in enter_state but rcu_barrier, so I dont think removing that >> helps much. Just was unable to get a proper log of them today due to >> most of them being cut in half. Will try bit more. > > In which case making __do_softirq() itself honor being in the > suspend/resume path might still be an option. > >> My belief is that as long as vcpu_migrate is not called in >> cpu_disable_scheduler, the vcpu->processor shall continue to point to >> offline cpu. Which will crash if the vcpu_wake is called for that vcpu. >> If vcpu_migrate is called, then vcpu_wake will still be called with some >> frequency but since vcpu->processor shall point to online cpu, and it >> won''t crash. So likely avoiding the wakes here completely is not the >> goal, just the offline ones. > > But you neglect the fact that waking vCPU-s at this point is > unnecessary anyway (they have nowhere to run on).What about adding a global scheduler_disable() in freeze_domains() and a scheduler_enable() in thaw_domains() which will switch scheduler locking to a global lock (or disable it at all?). This should solve all problems without any complex changes of current behavior. Juergen -- Juergen Gross Principal Developer Operating Systems PBG PDG ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967 Fujitsu Technology Solutions e-mail: juergen.gross@ts.fujitsu.com Domagkstr. 28 Internet: ts.fujitsu.com D-80807 Muenchen Company details: ts.fujitsu.com/imprint.html
Tomasz Wroblewski
2013-Jan-25 10:29 UTC
Re: [PATCH v2] Fix scheduler crash after s3 resume
On 25/01/13 11:23, Juergen Gross wrote:> What about adding a global scheduler_disable() in freeze_domains() and a > scheduler_enable() in thaw_domains() which will switch scheduler locking to > a global lock (or disable it at all?). This should solve all problems without > any complex changes of current behavior. > >Yes, I''d be very happy to test such a patch! Probably not familiar enough with this to be able to write the disable/enable logic myself though
>>> On 25.01.13 at 11:18, Tomasz Wroblewski <tomasz.wroblewski@citrix.com> wrote:>> But you neglect the fact that waking vCPU-s at this point is >> unnecessary anyway (they have nowhere to run on). >> > Hm, can''t they still run on cpu 0?They could, but there''s no point in even trying to before having thawed domains.> Also, I did earlier try to patch this > more primitively by adding a check to vcpu_wake whether the cpu is in > cpu_online_map, and whilst it stopped the crashing, I observed some > unexpected hangs w/o any serial output - sometimes just the dom0 would > hang, sometimes also xen would stop being responsive on serial. So I > figured out that these missed vcpu wakes account for something still.Oh, yes, of course, keeping vcpu_wake() from doing its job like this is surely calling for other problems. Jan
>>> On 25.01.13 at 11:23, Juergen Gross <juergen.gross@ts.fujitsu.com> wrote: > Am 25.01.2013 11:15, schrieb Jan Beulich: >>>>> On 25.01.13 at 10:45, Tomasz Wroblewski<tomasz.wroblewski@citrix.com> wrote: >> >>>> I think I had already raised the question of the placement of >>>> this rcu_barrier() here, and the lack of a counterpart in the >>>> suspend portion of the path. Keir? Or should >>>> rcu_barrier_action() avoid calling process_pending_softirqs() >>>> while still resuming, and instead call __do_softirq() with all but >>>> RCU_SOFTIRQ masked (perhaps through a suitable wrapper, >>>> or alternatively by open-coding its effect)? >>>> >>> Though I recall these vcpu_wake crashes happen also from other entry >>> points in enter_state but rcu_barrier, so I dont think removing that >>> helps much. Just was unable to get a proper log of them today due to >>> most of them being cut in half. Will try bit more. >> >> In which case making __do_softirq() itself honor being in the >> suspend/resume path might still be an option. >> >>> My belief is that as long as vcpu_migrate is not called in >>> cpu_disable_scheduler, the vcpu->processor shall continue to point to >>> offline cpu. Which will crash if the vcpu_wake is called for that vcpu. >>> If vcpu_migrate is called, then vcpu_wake will still be called with some >>> frequency but since vcpu->processor shall point to online cpu, and it >>> won''t crash. So likely avoiding the wakes here completely is not the >>> goal, just the offline ones. >> >> But you neglect the fact that waking vCPU-s at this point is >> unnecessary anyway (they have nowhere to run on). > > What about adding a global scheduler_disable() in freeze_domains() and a > scheduler_enable() in thaw_domains() which will switch scheduler locking to > a global lock (or disable it at all?). This should solve all problems > without > any complex changes of current behavior.I don''t see how this would address the so far described shortcomings. Jan
Am 25.01.2013 11:31, schrieb Jan Beulich:>>>> On 25.01.13 at 11:23, Juergen Gross<juergen.gross@ts.fujitsu.com> wrote: >> Am 25.01.2013 11:15, schrieb Jan Beulich: >>>>>> On 25.01.13 at 10:45, Tomasz Wroblewski<tomasz.wroblewski@citrix.com> wrote: >>> >>>>> I think I had already raised the question of the placement of >>>>> this rcu_barrier() here, and the lack of a counterpart in the >>>>> suspend portion of the path. Keir? Or should >>>>> rcu_barrier_action() avoid calling process_pending_softirqs() >>>>> while still resuming, and instead call __do_softirq() with all but >>>>> RCU_SOFTIRQ masked (perhaps through a suitable wrapper, >>>>> or alternatively by open-coding its effect)? >>>>> >>>> Though I recall these vcpu_wake crashes happen also from other entry >>>> points in enter_state but rcu_barrier, so I dont think removing that >>>> helps much. Just was unable to get a proper log of them today due to >>>> most of them being cut in half. Will try bit more. >>> >>> In which case making __do_softirq() itself honor being in the >>> suspend/resume path might still be an option. >>> >>>> My belief is that as long as vcpu_migrate is not called in >>>> cpu_disable_scheduler, the vcpu->processor shall continue to point to >>>> offline cpu. Which will crash if the vcpu_wake is called for that vcpu. >>>> If vcpu_migrate is called, then vcpu_wake will still be called with some >>>> frequency but since vcpu->processor shall point to online cpu, and it >>>> won''t crash. So likely avoiding the wakes here completely is not the >>>> goal, just the offline ones. >>> >>> But you neglect the fact that waking vCPU-s at this point is >>> unnecessary anyway (they have nowhere to run on). >> >> What about adding a global scheduler_disable() in freeze_domains() and a >> scheduler_enable() in thaw_domains() which will switch scheduler locking to >> a global lock (or disable it at all?). This should solve all problems >> without >> any complex changes of current behavior. > > I don''t see how this would address the so far described > shortcomings.The crash happens due to an access to the scheduler percpu area which isn''t allocated at the moment. The accessed element is the address of the scheduler lock for this cpu. Disabling the percpu locking scheme of the scheduler while the non-boot cpus are offline will avoid the crash. Juergen -- Juergen Gross Principal Developer Operating Systems PBG PDG ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967 Fujitsu Technology Solutions e-mail: juergen.gross@ts.fujitsu.com Domagkstr. 28 Internet: ts.fujitsu.com D-80807 Muenchen Company details: ts.fujitsu.com/imprint.html
>>> On 25.01.13 at 11:35, Juergen Gross <juergen.gross@ts.fujitsu.com> wrote: > Am 25.01.2013 11:31, schrieb Jan Beulich: >>>>> On 25.01.13 at 11:23, Juergen Gross<juergen.gross@ts.fujitsu.com> wrote: >>> Am 25.01.2013 11:15, schrieb Jan Beulich: >>>>>>> On 25.01.13 at 10:45, Tomasz Wroblewski<tomasz.wroblewski@citrix.com> wrote: >>>> >>>>>> I think I had already raised the question of the placement of >>>>>> this rcu_barrier() here, and the lack of a counterpart in the >>>>>> suspend portion of the path. Keir? Or should >>>>>> rcu_barrier_action() avoid calling process_pending_softirqs() >>>>>> while still resuming, and instead call __do_softirq() with all but >>>>>> RCU_SOFTIRQ masked (perhaps through a suitable wrapper, >>>>>> or alternatively by open-coding its effect)? >>>>>> >>>>> Though I recall these vcpu_wake crashes happen also from other entry >>>>> points in enter_state but rcu_barrier, so I dont think removing that >>>>> helps much. Just was unable to get a proper log of them today due to >>>>> most of them being cut in half. Will try bit more. >>>> >>>> In which case making __do_softirq() itself honor being in the >>>> suspend/resume path might still be an option. >>>> >>>>> My belief is that as long as vcpu_migrate is not called in >>>>> cpu_disable_scheduler, the vcpu->processor shall continue to point to >>>>> offline cpu. Which will crash if the vcpu_wake is called for that vcpu. >>>>> If vcpu_migrate is called, then vcpu_wake will still be called with some >>>>> frequency but since vcpu->processor shall point to online cpu, and it >>>>> won''t crash. So likely avoiding the wakes here completely is not the >>>>> goal, just the offline ones. >>>> >>>> But you neglect the fact that waking vCPU-s at this point is >>>> unnecessary anyway (they have nowhere to run on). >>> >>> What about adding a global scheduler_disable() in freeze_domains() and a >>> scheduler_enable() in thaw_domains() which will switch scheduler locking to >>> a global lock (or disable it at all?). This should solve all problems >>> without >>> any complex changes of current behavior. >> >> I don''t see how this would address the so far described >> shortcomings. > > The crash happens due to an access to the scheduler percpu area which isn''t > allocated at the moment. The accessed element is the address of the > scheduler > lock for this cpu. Disabling the percpu locking scheme of the scheduler > while > the non-boot cpus are offline will avoid the crash.Ah, okay. But that wouldn''t prevent other bad effects that could result from vCPU-s pointing to offline pCPU-s. Hence I think such a solution, even if sufficient for now, would set us up for future similar (and similarly hard to debug) issues. Jan
Am 25.01.2013 11:40, schrieb Jan Beulich:>>>> On 25.01.13 at 11:35, Juergen Gross<juergen.gross@ts.fujitsu.com> wrote: >> Am 25.01.2013 11:31, schrieb Jan Beulich: >>>>>> On 25.01.13 at 11:23, Juergen Gross<juergen.gross@ts.fujitsu.com> wrote: >>>> Am 25.01.2013 11:15, schrieb Jan Beulich: >>>>>>>> On 25.01.13 at 10:45, Tomasz Wroblewski<tomasz.wroblewski@citrix.com> wrote: >>>>> >>>>>>> I think I had already raised the question of the placement of >>>>>>> this rcu_barrier() here, and the lack of a counterpart in the >>>>>>> suspend portion of the path. Keir? Or should >>>>>>> rcu_barrier_action() avoid calling process_pending_softirqs() >>>>>>> while still resuming, and instead call __do_softirq() with all but >>>>>>> RCU_SOFTIRQ masked (perhaps through a suitable wrapper, >>>>>>> or alternatively by open-coding its effect)? >>>>>>> >>>>>> Though I recall these vcpu_wake crashes happen also from other entry >>>>>> points in enter_state but rcu_barrier, so I dont think removing that >>>>>> helps much. Just was unable to get a proper log of them today due to >>>>>> most of them being cut in half. Will try bit more. >>>>> >>>>> In which case making __do_softirq() itself honor being in the >>>>> suspend/resume path might still be an option. >>>>> >>>>>> My belief is that as long as vcpu_migrate is not called in >>>>>> cpu_disable_scheduler, the vcpu->processor shall continue to point to >>>>>> offline cpu. Which will crash if the vcpu_wake is called for that vcpu. >>>>>> If vcpu_migrate is called, then vcpu_wake will still be called with some >>>>>> frequency but since vcpu->processor shall point to online cpu, and it >>>>>> won''t crash. So likely avoiding the wakes here completely is not the >>>>>> goal, just the offline ones. >>>>> >>>>> But you neglect the fact that waking vCPU-s at this point is >>>>> unnecessary anyway (they have nowhere to run on). >>>> >>>> What about adding a global scheduler_disable() in freeze_domains() and a >>>> scheduler_enable() in thaw_domains() which will switch scheduler locking to >>>> a global lock (or disable it at all?). This should solve all problems >>>> without >>>> any complex changes of current behavior. >>> >>> I don''t see how this would address the so far described >>> shortcomings. >> >> The crash happens due to an access to the scheduler percpu area which isn''t >> allocated at the moment. The accessed element is the address of the >> scheduler >> lock for this cpu. Disabling the percpu locking scheme of the scheduler >> while >> the non-boot cpus are offline will avoid the crash. > > Ah, okay. But that wouldn''t prevent other bad effects that could > result from vCPU-s pointing to offline pCPU-s. Hence I think such > a solution, even if sufficient for now, would set us up for future > similar (and similarly hard to debug) issues.To avoid this problem you would have to change the suspend logic, I think. We could take the cpus just physically offline without removing all the state information from the system. During resume the old state and percpu areas would just be reused. While this would be a clean solution, it''s not a really small task to do... Juergen -- Juergen Gross Principal Developer Operating Systems PBG PDG ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967 Fujitsu Technology Solutions e-mail: juergen.gross@ts.fujitsu.com Domagkstr. 28 Internet: ts.fujitsu.com D-80807 Muenchen Company details: ts.fujitsu.com/imprint.html
Tomasz Wroblewski
2013-Jan-25 11:56 UTC
Re: [PATCH v2] Fix scheduler crash after s3 resume
> The crash happens due to an access to the scheduler percpu area which isn''t > allocated at the moment. The accessed element is the address of the scheduler > lock for this cpu. Disabling the percpu locking scheme of the scheduler while > the non-boot cpus are offline will avoid the crash. > >Ok, so I tried this approach (by turning the locking in vcpu_wake to be conditional based on system_state), and whilst it stopped vcpu_wake crash I traded it for a crash in acpi_cpufreq_target: (XEN) ----[ Xen-4.3-unstable x86_64 debug=y Not tainted ]---- (XEN) CPU: 3 (XEN) RIP: e008:[<ffff82c4801a0594>] acpi_cpufreq_target+0x165/0x33b (XEN) RFLAGS: 0000000000010293 CONTEXT: hypervisor (XEN) rax: 0000000000000000 rbx: ffff830137bc7300 rcx: 0000000000000000 (XEN) rdx: 0000000000000009 rsi: ffff82c480265460 rdi: ffff830137bd7d60 (XEN) rbp: ffff830137bd7db0 rsp: ffff830137bd7d30 r8: 0000000000000004 (XEN) r9: 00000000fffffffe r10: 0000000000000009 r11: 0000000000000000 (XEN) r12: ffff830137bc7c70 r13: ffff8301025444f8 r14: ffff830137bc7c70 (XEN) r15: 0000000001b5b14c cr0: 000000008005003b cr4: 00000000000026f0 (XEN) cr3: 00000000ba674000 cr2: 000000000000004c (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 (XEN) Xen stack trace from rsp=ffff830137bd7d30: (XEN) 000000008017d626 0000000000000009 00000009000000fb ffff830100000001 (XEN) ffff830137bd7d60 0000080000000199 0000000000000000 0000000000000000 (XEN) 0000000000000000 0000000000000000 ffffffff37bd7da0 00000000ffffffea (XEN) ffff830137bc7c70 00000000002936c8 0000000006d9e30a 0000000001b5b14c (XEN) ffff830137bd7df0 ffff82c4801414ee ffff830137bc7c70 0000000000000003 (XEN) ffff830137bd7df0 0000000000000008 0000000000000008 ffff830102ae1340 (XEN) ffff830137bd7e50 ffff82c480140815 ffff8301141624a0 002936c800000286 (XEN) ffff82c480308dc0 ffff830137bc7c70 0000000000000003 ffff830102ae1380 (XEN) ffff830137bebb50 ffff830137bebc00 0000000000000010 0000000000000030 (XEN) ffff830137bd7e70 ffff82c480140a2b ffff830137bd7e70 0000001548c205b8 (XEN) ffff830137bd7ef0 ffff82c4801a31da 0000000000000002(XEN) Resetting with ACPI MEMORY or I/O RESET_REG. (call graph sadly got eaten) which corresponds to the following lines in cpufreq.c freqs.old = perf->states[perf->state].core_frequency * 1000; freqs.new = data->freq_table[next_state].frequency; ffff82c4801a058d: 8b 55 94 mov -0x6c(%rbp),%edx ffff82c4801a0590: 48 8b 43 08 mov 0x8(%rbx),%rax ffff82c4801a0594: 8b 44 d0 04 mov 0x4(%rax,%rdx,8),%eax ffff82c4801a0598: 89 45 8c mov %eax,-0x74(%rbp) ffff82c4801a059b: 48 c7 c0 00 80 ff ff mov $0xffffffffffff8000,%rax ffff82c4801a05a2: 48 21 e0 and %rsp,%rax which I guess crashes because either freq_table or data is freed at this point (indeed seems that cpufreq driver has some cpu up/down logic which frees it). Given this is not even first place in acpi_freq_target this is accessed, it looks like the cpu got torn down halfway thru this function... Suspect there are likely to be more sites affected by this. I also tried Jan''s suggestion of making do_softirq skip its job if we are resuming, that causes a hang in rcu_barrier(), adding another resume conditional rcu_barrier() made it progress further but crash elsewhere (don''t remember where exactly, this approach looked a bit like dead end so i abandoned it quickly) So still not having a better solution than the revert of the cpu_disable_schedule() hunk.
>>> On 25.01.13 at 12:56, Tomasz Wroblewski <tomasz.wroblewski@citrix.com> wrote: > I also tried Jan''s suggestion of making do_softirq skip its job if we are > resuming, that causes a hang in rcu_barrier(), adding another resume > conditional rcu_barrier() made it progress further but crash elsewhere (don''t > remember where exactly, this approach looked a bit like dead end so i > abandoned it quickly)I didn''t say exactly that - for avoiding the hang, you still need to service the RCU softirq. Jan
Tomasz Wroblewski
2013-Jan-25 13:58 UTC
Re: [PATCH v2] Fix scheduler crash after s3 resume
On 25/01/13 13:27, Jan Beulich wrote:>>>> On 25.01.13 at 12:56, Tomasz Wroblewski<tomasz.wroblewski@citrix.com> wrote: >>>> >> I also tried Jan''s suggestion of making do_softirq skip its job if we are >> resuming, that causes a hang in rcu_barrier(), adding another resume >> conditional rcu_barrier() made it progress further but crash elsewhere (don''t >> remember where exactly, this approach looked a bit like dead end so i >> abandoned it quickly) >> > I didn''t say exactly that - for avoiding the hang, you still need to > service the RCU softirq. >Ah sorry. Ok this approach seems to work for avoiding vcpu_wake crash, although I am still getting this occasional crash in acpi_freq_target, same as when I tried Jurgen''s suggestion. Not getting call stack captured for that so not sure where''s that called from, but will try to pursue it abit.