Ian Campbell
2012-Oct-26 12:59 UTC
#599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
Hi all, I''ve BCC''d a number of people who have reported seeing this bug at various times in the past. If you can still repro I''d appreciate it if you could give the patch in http://marc.info/?l=xen-devel&m=135049062216685&w=2 (also attached) a go and report back success/failure and the output of the debugging messages produced. Thanks, Ian. -- Ian Campbell Current Noise: Death - Evil Dead Executive ability is prominent in your make-up. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Mauro
2012-Oct-26 18:25 UTC
Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
On 26 October 2012 14:59, Ian Campbell <ijc@hellion.org.uk> wrote:> Hi all, > > I''ve BCC''d a number of people who have reported seeing this bug at > various times in the past. > > If you can still repro I''d appreciate it if you could give the patch in > http://marc.info/?l=xen-devel&m=135049062216685&w=2 (also attached) a go > and report back success/failure and the output of the debugging messages > produced.Is that patch for amd64 architectures?
Ian Campbell
2012-Oct-26 18:40 UTC
Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
On Fri, 2012-10-26 at 19:25 +0100, Mauro wrote:> On 26 October 2012 14:59, Ian Campbell <ijc@hellion.org.uk> wrote: > > Hi all, > > > > I''ve BCC''d a number of people who have reported seeing this bug at > > various times in the past. > > > > If you can still repro I''d appreciate it if you could give the patch in > > http://marc.info/?l=xen-devel&m=135049062216685&w=2 (also attached) a go > > and report back success/failure and the output of the debugging messages > > produced. > > Is that patch for amd64 architectures?It is for 32 and 64 bit x86, so yes. BTW, you were the original recipient of this patch in the thread linked above. Ian. -- Ian Campbell "I''d love to go out with you, but I''m having all my plants neutered."
<Philippe.Simonet@swisscom.com>
2012-Nov-07 10:10 UTC
Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
Hi Ian i compiled a patched hypervisor for Mauro, it is running since many days and the overflow occured, without clock jumps> (XEN) XXX plt_overflow: plt_now=5ece12d34128 plt_wrap=5ece12d09306 > now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5 > plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854 > tsc_stamp=e3839fcb0273(below is the complete xm dmesg output) did that help you ? do you need more info ? thanks and regards Philippe> -----Original Message----- > From: Mauro [mailto:mrsanna1@gmail.com] > Sent: Wednesday, November 07, 2012 10:12 AM > To: Simonet Philippe, ITS-OUS-OP-IFM-NW-IPE > Subject: Re: [Xen-devel] [Xen-users] Re: Xen 4 TSC problems > > Hello, no news until now there aren''t clock jumps. > Here is xm dmesg: > > xm dmesg > (XEN) Xen version 4.0.1 (Debian 4.0.1-5.4) (ultrotter@debian.org) (gcc > version 4.4.5 (Debian 4.4.5-8) ) Mon Oct 29 14:42:12 CET 2012 > (XEN) Bootloader: GRUB 1.98+20100804-14+squeeze1 > (XEN) Command line: placeholder dom0_mem=3072M loglvl=warning > guest_loglvl=warning > (XEN) Video information: > (XEN) VGA is text mode 80x25, font 8x16 > (XEN) VBE/DDC methods: V2; EDID transfer time: 2 seconds > (XEN) Disc information: > (XEN) Found 2 MBR signatures > (XEN) Found 2 EDD information structures > (XEN) Xen-e820 RAM map: > (XEN) 0000000000000000 - 000000000009f400 (usable) > (XEN) 000000000009f400 - 00000000000a0000 (reserved) > (XEN) 00000000000f0000 - 0000000000100000 (reserved) > (XEN) 0000000000100000 - 00000000cfd43000 (usable) > (XEN) 00000000cfd43000 - 00000000cfd4c000 (ACPI data) > (XEN) 00000000cfd4c000 - 00000000cfd4d000 (usable) > (XEN) 00000000cfd4d000 - 00000000d0000000 (reserved) > (XEN) 00000000e0000000 - 00000000f0000000 (reserved) > (XEN) 00000000fec00000 - 00000000fed00000 (reserved) > (XEN) 00000000fee00000 - 00000000fee10000 (reserved) > (XEN) 00000000ffc00000 - 0000000100000000 (reserved) > (XEN) 0000000100000000 - 000000102ffff000 (usable) > (XEN) ACPI: RSDP 000F4F20, 0024 (r2 HP ) > (XEN) ACPI: XSDT CFD43900, 007C (r1 HP ProLiant 2 162E) > (XEN) ACPI: FACP CFD439C0, 00F4 (r3 HP ProLiant 2 162E) > (XEN) ACPI: DSDT CFD43AC0, 30C9 (r1 HP DSDT 1 INTL 20030228) > (XEN) ACPI: FACS CFD43100, 0040 > (XEN) ACPI: SPCR CFD43140, 0050 (r1 HP SPCRRBSU 1 162E) > (XEN) ACPI: MCFG CFD431C0, 003C (r1 HP ProLiant 1 0) > (XEN) ACPI: HPET CFD43200, 0038 (r1 HP ProLiant 2 162E) > (XEN) ACPI: FFFF CFD43240, 0064 (r2 HP P61 2 162E) > (XEN) ACPI: SPMI CFD432C0, 0040 (r5 HP ProLiant 1 162E) > (XEN) ACPI: ERST CFD43300, 01D0 (r1 HP ProLiant 1 162E) > (XEN) ACPI: APIC CFD43500, 0176 (r1 HP ProLiant 2 0) > (XEN) ACPI: FFFF CFD43680, 0176 (r1 HP ProLiant 1 162E) > (XEN) ACPI: BERT CFD43800, 0030 (r1 HP ProLiant 1 162E) > (XEN) ACPI: HEST CFD43840, 00BC (r1 HP ProLiant 1 162E) > (XEN) System RAM: 65532MB (67105672kB) > (XEN) Domain heap initialised > (XEN) Processor #0 6:15 APIC version 20 > (XEN) Processor #8 6:15 APIC version 20 > (XEN) Processor #16 6:15 APIC version 20 > (XEN) Processor #24 6:15 APIC version 20 > (XEN) Processor #1 6:15 APIC version 20 > (XEN) Processor #9 6:15 APIC version 20 > (XEN) Processor #17 6:15 APIC version 20 > (XEN) Processor #25 6:15 APIC version 20 > (XEN) Processor #2 6:15 APIC version 20 > (XEN) Processor #10 6:15 APIC version 20 > (XEN) Processor #18 6:15 APIC version 20 > (XEN) Processor #26 6:15 APIC version 20 > (XEN) Processor #3 6:15 APIC version 20 > (XEN) Processor #11 6:15 APIC version 20 > (XEN) Processor #19 6:15 APIC version 20 > (XEN) Processor #27 6:15 APIC version 20 > (XEN) IOAPIC[0]: apic_id 1, version 32, address 0xfec00000, GSI 0-23 > (XEN) IOAPIC[1]: apic_id 2, version 32, address 0xfec80000, GSI 24-47 > (XEN) IOAPIC[2]: apic_id 3, version 32, address 0xfec81000, GSI 48-71 > (XEN) IOAPIC[3]: apic_id 4, version 32, address 0xfec81800, GSI 72-95 > (XEN) Enabling APIC mode: Phys. Using 4 I/O APICs > (XEN) Using scheduler: SMP Credit Scheduler (credit) > (XEN) Detected 2400.128 MHz processor. > (XEN) Initing memory sharing. > (XEN) VMX: Supported advanced features: > (XEN) - APIC MMIO access virtualisation > (XEN) - APIC TPR shadow > (XEN) - Virtual NMI > (XEN) - MSR direct-access bitmap > (XEN) HVM: ASIDs disabled. > (XEN) HVM: VMX enabled > (XEN) I/O virtualisation disabled > (XEN) Total of 16 processors activated. > (XEN) ENABLING IO-APIC IRQs > (XEN) -> Using new ACK method > (XEN) checking TSC synchronization across 16 CPUs: > (XEN) CPU#14 had 3 usecs TSC skew, fixed it up. > (XEN) Platform timer is 14.318MHz HPET > (XEN) Allocated console ring of 32 KiB. > (XEN) Brought up 16 CPUs > (XEN) *** LOADING DOMAIN 0 *** > (XEN) Xen kernel: 64-bit, lsb, compat32 > (XEN) Dom0 kernel: 64-bit, PAE, lsb, paddr 0x1000000 -> 0x1708000 > (XEN) PHYSICAL MEMORY ARRANGEMENT: > (XEN) Dom0 alloc.: 000000083c000000->0000000840000000 (770048 pages > to be allocated) > (XEN) VIRTUAL MEMORY ARRANGEMENT: > (XEN) Loaded kernel: ffffffff81000000->ffffffff81708000 > (XEN) Init. ramdisk: ffffffff81708000->ffffffff81efb000 > (XEN) Phys-Mach map: ffffffff81efb000->ffffffff824fb000 > (XEN) Start info: ffffffff824fb000->ffffffff824fb4b4 > (XEN) Page tables: ffffffff824fc000->ffffffff82513000 > (XEN) Boot stack: ffffffff82513000->ffffffff82514000 > (XEN) TOTAL: ffffffff80000000->ffffffff82800000 > (XEN) ENTRY ADDRESS: ffffffff81531200 > (XEN) Dom0 has maximum 16 VCPUs > (XEN) Scrubbing Free RAM: > ..................................................................................................................... > ..................................................................................................................... > ..................................................................................................................... > ..................................................................................................................... > ..................................................................................................................... > ................................done. > (XEN) Xen trace buffers: disabled > (XEN) Std. Loglevel: Errors and warnings > (XEN) Guest Loglevel: Errors and warnings > (XEN) Xen is relinquishing VGA console. > (XEN) *** Serial input -> DOM0 (type ''CTRL-a'' three times to switch input to > Xen) > (XEN) Freed 176kB init memory. > (XEN) XXX plt_overflow: plt_now=5ece12d34128 plt_wrap=5ece12d09306 > now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5 > plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854 > tsc_stamp=e3839fcb0273 > >> -----Original Message----- > From: xen-devel-bounces@lists.xen.org [mailto:xen-devel- > bounces@lists.xen.org] On Behalf Of Ian Campbell > Sent: Friday, October 26, 2012 3:00 PM > To: xen-devel@lists.xen.org > Cc: 599161@bugs.debian.org > Subject: [Xen-devel] #599161: Xen debug patch for the "clock shifts by 50 > minutes" bug. > > Hi all, > > I''ve BCC''d a number of people who have reported seeing this bug at various > times in the past. > > If you can still repro I''d appreciate it if you could give the patch in > http://marc.info/?l=xen-devel&m=135049062216685&w=2 (also attached) a > go and report back success/failure and the output of the debugging > messages produced. > > Thanks, > Ian. > > -- > Ian Campbell > Current Noise: Death - Evil Dead > > Executive ability is prominent in your make-up.
Ian Campbell
2012-Nov-07 13:22 UTC
Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
On Wed, 2012-11-07 at 10:10 +0000, Philippe.Simonet@swisscom.com wrote:> Hi IanThanks for doing this test.> i compiled a patched hypervisor for Mauro, it is running since many > days and the overflow occured, without clock jumpsSo just to be clear you saw this logging occur *without* the 50 minute jump in time? That''s good!> > (XEN) XXX plt_overflow: plt_now=5ece12d34128 plt_wrap=5ece12d09306 > > now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5 > > plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854 > > tsc_stamp=e3839fcb0273 > > (below is the complete xm dmesg output) > > did that help you ? do you need more info ?I''ll leave this to Keir (who wrote the debugging patch) to answer but it looks to me like it should be useful! Thanks again. Ian.> thanks and regards > > Philippe > > > > -----Original Message----- > > From: Mauro [mailto:mrsanna1@gmail.com] > > Sent: Wednesday, November 07, 2012 10:12 AM > > To: Simonet Philippe, ITS-OUS-OP-IFM-NW-IPE > > Subject: Re: [Xen-devel] [Xen-users] Re: Xen 4 TSC problems > > > > Hello, no news until now there aren''t clock jumps. > > Here is xm dmesg: > > > > xm dmesg > > (XEN) Xen version 4.0.1 (Debian 4.0.1-5.4) (ultrotter@debian.org) (gcc > > version 4.4.5 (Debian 4.4.5-8) ) Mon Oct 29 14:42:12 CET 2012 > > (XEN) Bootloader: GRUB 1.98+20100804-14+squeeze1 > > (XEN) Command line: placeholder dom0_mem=3072M loglvl=warning > > guest_loglvl=warning > > (XEN) Video information: > > (XEN) VGA is text mode 80x25, font 8x16 > > (XEN) VBE/DDC methods: V2; EDID transfer time: 2 seconds > > (XEN) Disc information: > > (XEN) Found 2 MBR signatures > > (XEN) Found 2 EDD information structures > > (XEN) Xen-e820 RAM map: > > (XEN) 0000000000000000 - 000000000009f400 (usable) > > (XEN) 000000000009f400 - 00000000000a0000 (reserved) > > (XEN) 00000000000f0000 - 0000000000100000 (reserved) > > (XEN) 0000000000100000 - 00000000cfd43000 (usable) > > (XEN) 00000000cfd43000 - 00000000cfd4c000 (ACPI data) > > (XEN) 00000000cfd4c000 - 00000000cfd4d000 (usable) > > (XEN) 00000000cfd4d000 - 00000000d0000000 (reserved) > > (XEN) 00000000e0000000 - 00000000f0000000 (reserved) > > (XEN) 00000000fec00000 - 00000000fed00000 (reserved) > > (XEN) 00000000fee00000 - 00000000fee10000 (reserved) > > (XEN) 00000000ffc00000 - 0000000100000000 (reserved) > > (XEN) 0000000100000000 - 000000102ffff000 (usable) > > (XEN) ACPI: RSDP 000F4F20, 0024 (r2 HP ) > > (XEN) ACPI: XSDT CFD43900, 007C (r1 HP ProLiant 2 162E) > > (XEN) ACPI: FACP CFD439C0, 00F4 (r3 HP ProLiant 2 162E) > > (XEN) ACPI: DSDT CFD43AC0, 30C9 (r1 HP DSDT 1 INTL 20030228) > > (XEN) ACPI: FACS CFD43100, 0040 > > (XEN) ACPI: SPCR CFD43140, 0050 (r1 HP SPCRRBSU 1 162E) > > (XEN) ACPI: MCFG CFD431C0, 003C (r1 HP ProLiant 1 0) > > (XEN) ACPI: HPET CFD43200, 0038 (r1 HP ProLiant 2 162E) > > (XEN) ACPI: FFFF CFD43240, 0064 (r2 HP P61 2 162E) > > (XEN) ACPI: SPMI CFD432C0, 0040 (r5 HP ProLiant 1 162E) > > (XEN) ACPI: ERST CFD43300, 01D0 (r1 HP ProLiant 1 162E) > > (XEN) ACPI: APIC CFD43500, 0176 (r1 HP ProLiant 2 0) > > (XEN) ACPI: FFFF CFD43680, 0176 (r1 HP ProLiant 1 162E) > > (XEN) ACPI: BERT CFD43800, 0030 (r1 HP ProLiant 1 162E) > > (XEN) ACPI: HEST CFD43840, 00BC (r1 HP ProLiant 1 162E) > > (XEN) System RAM: 65532MB (67105672kB) > > (XEN) Domain heap initialised > > (XEN) Processor #0 6:15 APIC version 20 > > (XEN) Processor #8 6:15 APIC version 20 > > (XEN) Processor #16 6:15 APIC version 20 > > (XEN) Processor #24 6:15 APIC version 20 > > (XEN) Processor #1 6:15 APIC version 20 > > (XEN) Processor #9 6:15 APIC version 20 > > (XEN) Processor #17 6:15 APIC version 20 > > (XEN) Processor #25 6:15 APIC version 20 > > (XEN) Processor #2 6:15 APIC version 20 > > (XEN) Processor #10 6:15 APIC version 20 > > (XEN) Processor #18 6:15 APIC version 20 > > (XEN) Processor #26 6:15 APIC version 20 > > (XEN) Processor #3 6:15 APIC version 20 > > (XEN) Processor #11 6:15 APIC version 20 > > (XEN) Processor #19 6:15 APIC version 20 > > (XEN) Processor #27 6:15 APIC version 20 > > (XEN) IOAPIC[0]: apic_id 1, version 32, address 0xfec00000, GSI 0-23 > > (XEN) IOAPIC[1]: apic_id 2, version 32, address 0xfec80000, GSI 24-47 > > (XEN) IOAPIC[2]: apic_id 3, version 32, address 0xfec81000, GSI 48-71 > > (XEN) IOAPIC[3]: apic_id 4, version 32, address 0xfec81800, GSI 72-95 > > (XEN) Enabling APIC mode: Phys. Using 4 I/O APICs > > (XEN) Using scheduler: SMP Credit Scheduler (credit) > > (XEN) Detected 2400.128 MHz processor. > > (XEN) Initing memory sharing. > > (XEN) VMX: Supported advanced features: > > (XEN) - APIC MMIO access virtualisation > > (XEN) - APIC TPR shadow > > (XEN) - Virtual NMI > > (XEN) - MSR direct-access bitmap > > (XEN) HVM: ASIDs disabled. > > (XEN) HVM: VMX enabled > > (XEN) I/O virtualisation disabled > > (XEN) Total of 16 processors activated. > > (XEN) ENABLING IO-APIC IRQs > > (XEN) -> Using new ACK method > > (XEN) checking TSC synchronization across 16 CPUs: > > (XEN) CPU#14 had 3 usecs TSC skew, fixed it up. > > (XEN) Platform timer is 14.318MHz HPET > > (XEN) Allocated console ring of 32 KiB. > > (XEN) Brought up 16 CPUs > > (XEN) *** LOADING DOMAIN 0 *** > > (XEN) Xen kernel: 64-bit, lsb, compat32 > > (XEN) Dom0 kernel: 64-bit, PAE, lsb, paddr 0x1000000 -> 0x1708000 > > (XEN) PHYSICAL MEMORY ARRANGEMENT: > > (XEN) Dom0 alloc.: 000000083c000000->0000000840000000 (770048 pages > > to be allocated) > > (XEN) VIRTUAL MEMORY ARRANGEMENT: > > (XEN) Loaded kernel: ffffffff81000000->ffffffff81708000 > > (XEN) Init. ramdisk: ffffffff81708000->ffffffff81efb000 > > (XEN) Phys-Mach map: ffffffff81efb000->ffffffff824fb000 > > (XEN) Start info: ffffffff824fb000->ffffffff824fb4b4 > > (XEN) Page tables: ffffffff824fc000->ffffffff82513000 > > (XEN) Boot stack: ffffffff82513000->ffffffff82514000 > > (XEN) TOTAL: ffffffff80000000->ffffffff82800000 > > (XEN) ENTRY ADDRESS: ffffffff81531200 > > (XEN) Dom0 has maximum 16 VCPUs > > (XEN) Scrubbing Free RAM: > > ..................................................................................................................... > > ..................................................................................................................... > > ..................................................................................................................... > > ..................................................................................................................... > > ..................................................................................................................... > > ................................done. > > (XEN) Xen trace buffers: disabled > > (XEN) Std. Loglevel: Errors and warnings > > (XEN) Guest Loglevel: Errors and warnings > > (XEN) Xen is relinquishing VGA console. > > (XEN) *** Serial input -> DOM0 (type ''CTRL-a'' three times to switch input to > > Xen) > > (XEN) Freed 176kB init memory. > > (XEN) XXX plt_overflow: plt_now=5ece12d34128 plt_wrap=5ece12d09306 > > now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5 > > plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854 > > tsc_stamp=e3839fcb0273 > > > > > > > > > > > -----Original Message----- > > From: xen-devel-bounces@lists.xen.org [mailto:xen-devel- > > bounces@lists.xen.org] On Behalf Of Ian Campbell > > Sent: Friday, October 26, 2012 3:00 PM > > To: xen-devel@lists.xen.org > > Cc: 599161@bugs.debian.org > > Subject: [Xen-devel] #599161: Xen debug patch for the "clock shifts by 50 > > minutes" bug. > > > > Hi all, > > > > I''ve BCC''d a number of people who have reported seeing this bug at various > > times in the past. > > > > If you can still repro I''d appreciate it if you could give the patch in > > http://marc.info/?l=xen-devel&m=135049062216685&w=2 (also attached) a > > go and report back success/failure and the output of the debugging > > messages produced. > > > > Thanks, > > Ian. > > > > -- > > Ian Campbell > > Current Noise: Death - Evil Dead > > > > Executive ability is prominent in your make-up.
Jan Beulich
2012-Nov-07 17:07 UTC
Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
>>> On 07.11.12 at 11:10, <Philippe.Simonet@swisscom.com> wrote: > i compiled a patched hypervisor for Mauro, it is running since many days > and the overflow occured, > without clock jumps > >> (XEN) XXX plt_overflow: plt_now=5ece12d34128 plt_wrap=5ece12d09306 >> now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5 >> plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854 >> tsc_stamp=e3839fcb0273i.e. we have (in order of time) plt_wrap=5ece12d09306 now=5ece12d16292 plt_now=5ece12d34128 which is exactly the inverse order of how things should be (now not necessarily being in the middle). Nor should plt_now and plt_wrap be that close together. So far I have no idea how this can be explained. Jan
Keir Fraser
2012-Nov-07 17:40 UTC
Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
On 07/11/2012 13:22, "Ian Campbell" <ijc@hellion.org.uk> wrote:>>> (XEN) XXX plt_overflow: plt_now=5ece12d34128 plt_wrap=5ece12d09306 >>> now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5 >>> plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854 >>> tsc_stamp=e3839fcb0273 >> >> (below is the complete xm dmesg output) >> >> did that help you ? do you need more info ? > > I''ll leave this to Keir (who wrote the debugging patch) to answer but it > looks to me like it should be useful!I''m scratching my head. plt_wrap is earlier than plt_now, which should be impossible. plt_stamp64 oddly has low 32 bits identical to new_stamp. That seems very very improbable! I wonder whether the overflow handling should just be removed, or made conditional on a command-line parameter, or on the 32-bit platform counter being at least somewhat likely to overflow before a softirq occurs -- it seems lots of systems are using 14MHz HPET, and that gives us a couple of minutes for the plt_overflow softirq to do its work before overflow occurs. I think we would notice that outage in other ways. :) -- Keir
Jan Beulich
2012-Nov-08 09:39 UTC
Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
>>> On 07.11.12 at 18:40, Keir Fraser <keir@xen.org> wrote: > On 07/11/2012 13:22, "Ian Campbell" <ijc@hellion.org.uk> wrote: > >>>> (XEN) XXX plt_overflow: plt_now=5ece12d34128 plt_wrap=5ece12d09306 >>>> now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5 >>>> plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854 >>>> tsc_stamp=e3839fcb0273 >>> >>> (below is the complete xm dmesg output) >>> >>> did that help you ? do you need more info ? >> >> I''ll leave this to Keir (who wrote the debugging patch) to answer but it >> looks to me like it should be useful! > > I''m scratching my head. plt_wrap is earlier than plt_now, which should be > impossible. plt_stamp64 oddly has low 32 bits identical to new_stamp. That > seems very very improbable!Is it? My understanding was that plt_stamp64 is just a software extension to the more narrow HW counter, and hence the low plt_mask bits would always be expected to be identical. The plt_wrap < plt_now thing of course is entirely unexplainable to me too: Considering that plt_scale doesn''t change at all post- boot, apart from memory corruption I could only see an memory access ordering problem to be the reason (platform_timer_stamp and/or stime_platform_stamp changing despite platform_timer_lock being held. So maybe taking a snapshot of all three static values involved in the calculation in __read_platform_stime() between acquiring the lock and the first call to __read_platform_stime(), and printing them together with the "live" values in a second printk() after the one your original patch added could rule that out. But the box doesn''t even seem to be NUMA (of course it also doesn''t help that the log level was kept restricted - hint, hint, Philippe), not does there appear to be any S3 cycle or pCPU bring-up/-down in between... Philippe, could you clarify again what CPU model(s) this is being observed on (the long times between individual steps forward with this problem perhaps warrant repeating the basics each time, as it''s otherwise quite cumbersome to always look up old pieces of information).> I wonder whether the overflow handling should just be removed, or made > conditional on a command-line parameter, or on the 32-bit platform counter > being at least somewhat likely to overflow before a softirq occurs -- it > seems lots of systems are using 14MHz HPET, and that gives us a couple of > minutes for the plt_overflow softirq to do its work before overflow occurs. > I think we would notice that outage in other ways. :)Iirc we added this for a good reason - to cover the, however unlikely, event of Xen running for very long without preemption. Presumably most of the cases got fixed meanwhile, and indeed a wraparound time on the order of minutes should make this superfluous, but as the case here shows that code did spot a severe anomaly (whatever that may turn out to be). Also recall that there are HPET implementations around that tick at a much higher frequency than 14MHz. So unless we finally reach the understanding that the code is flawed, I would rather want to keep it. Jan
Keir Fraser
2012-Nov-08 10:38 UTC
Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
On 08/11/2012 09:39, "Jan Beulich" <JBeulich@suse.com> wrote:>>>>> (XEN) XXX plt_overflow: plt_now=5ece12d34128 plt_wrap=5ece12d09306 >>>>> now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5 >>>>> plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854 >>>>> tsc_stamp=e3839fcb0273 >>>> >>>> (below is the complete xm dmesg output) >>>> >>>> did that help you ? do you need more info ? >>> >>> I''ll leave this to Keir (who wrote the debugging patch) to answer but it >>> looks to me like it should be useful! >> >> I''m scratching my head. plt_wrap is earlier than plt_now, which should be >> impossible. plt_stamp64 oddly has low 32 bits identical to new_stamp. That >> seems very very improbable! > > Is it? My understanding was that plt_stamp64 is just a software > extension to the more narrow HW counter, and hence the low > plt_mask bits would always be expected to be identical.No, plt_stamp is simply the HW counter time at which plt_stamp64 was last brought up to date. Hence plt_stamp64 is updated as: plt_stamp64 += (new_stamp - old_stamp) & plt_mask; Hence why seeing plt_stamp64&plt_mask == new_stamp is very unexpected! -- Keir
Ian Campbell
2012-Nov-08 11:43 UTC
Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
On Wed, 2012-11-07 at 17:40 +0000, Keir Fraser wrote:> On 07/11/2012 13:22, "Ian Campbell" <ijc@hellion.org.uk> wrote: > > >>> (XEN) XXX plt_overflow: plt_now=5ece12d34128 plt_wrap=5ece12d09306 > >>> now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5 > >>> plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854 > >>> tsc_stamp=e3839fcb0273 > >> > >> (below is the complete xm dmesg output) > >> > >> did that help you ? do you need more info ? > > > > I''ll leave this to Keir (who wrote the debugging patch) to answer but it > > looks to me like it should be useful! > > I''m scratching my head. plt_wrap is earlier than plt_now, which should be > impossible.impossible due to guarantees made by the h/w or by construction in Xen. There appears to be a certain amount of hardware-specificness to the issue -- so I''m wondering if maybe there are some platforms whose tsc is not as monotonically increasing as it needs to be...> plt_stamp64 oddly has low 32 bits identical to new_stamp. That > seems very very improbable!Does this code run on all cpus or just one? Is it always the same one?> I wonder whether the overflow handling should just be removed, or made > conditional on a command-line parameter, or on the 32-bit platform counter > being at least somewhat likely to overflow before a softirq occurs -- it > seems lots of systems are using 14MHz HPET, and that gives us a couple of > minutes for the plt_overflow softirq to do its work before overflow occurs. > I think we would notice that outage in other ways. :) > > -- Keir > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
Keir Fraser
2012-Nov-08 12:54 UTC
Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
On 08/11/2012 11:43, "Ian Campbell" <ijc@hellion.org.uk> wrote:>>> I''ll leave this to Keir (who wrote the debugging patch) to answer but it >>> looks to me like it should be useful! >> >> I''m scratching my head. plt_wrap is earlier than plt_now, which should be >> impossible. > > impossible due to guarantees made by the h/w or by construction in Xen.That''s a question, right? By construction in Xen.> There appears to be a certain amount of hardware-specificness to the > issue -- so I''m wondering if maybe there are some platforms whose tsc is > not as monotonically increasing as it needs to be...plt_* timestamps are not derived from TSC at all.>> plt_stamp64 oddly has low 32 bits identical to new_stamp. That >> seems very very improbable! > > Does this code run on all cpus or just one? Is it always the same one?Always cpu0. -- Keir
<Philippe.Simonet@swisscom.com>
2012-Nov-08 13:47 UTC
Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
Hi Mauro, that''s a question for you :> Philippe, could you clarify again what CPU model(s) this is being observed on > (the long times between individual steps forward with this problem perhaps > warrant repeating the basics each time, as it''s otherwise quite cumbersome > to always look up old pieces of information).can you provide this information ? cat /proc/cpuinfo cat /proc/meminfo hardware information (manufacturer, model, urls, ...) Thanks, Philippe> -----Original Message----- > From: Jan Beulich [mailto:JBeulich@suse.com] > Sent: Thursday, November 08, 2012 10:40 AM > To: Simonet Philippe, ITS-OUS-OP-IFM-NW-IPE; Keir Fraser > Cc: 599161@bugs.debian.org; mrsanna1@gmail.com; Ian Campbell; xen- > devel@lists.xen.org > Subject: Re: [Xen-devel] #599161: Xen debug patch for the "clock shifts by 50 > minutes" bug. > > >>> On 07.11.12 at 18:40, Keir Fraser <keir@xen.org> wrote: > > On 07/11/2012 13:22, "Ian Campbell" <ijc@hellion.org.uk> wrote: > > > >>>> (XEN) XXX plt_overflow: plt_now=5ece12d34128 > plt_wrap=5ece12d09306 > >>>> now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5 > >>>> plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854 > >>>> tsc_stamp=e3839fcb0273 > >>> > >>> (below is the complete xm dmesg output) > >>> > >>> did that help you ? do you need more info ? > >> > >> I''ll leave this to Keir (who wrote the debugging patch) to answer but > >> it looks to me like it should be useful! > > > > I''m scratching my head. plt_wrap is earlier than plt_now, which should > > be impossible. plt_stamp64 oddly has low 32 bits identical to > > new_stamp. That seems very very improbable! > > Is it? My understanding was that plt_stamp64 is just a software extension to > the more narrow HW counter, and hence the low plt_mask bits would always > be expected to be identical. > > The plt_wrap < plt_now thing of course is entirely unexplainable to me too: > Considering that plt_scale doesn''t change at all post- boot, apart from > memory corruption I could only see an memory access ordering problem to > be the reason (platform_timer_stamp and/or stime_platform_stamp > changing despite platform_timer_lock being held. So maybe taking a > snapshot of all three static values involved in the calculation in > __read_platform_stime() between acquiring the lock and the first call to > __read_platform_stime(), and printing them together with the "live" values > in a second > printk() after the one your original patch added could rule that out. > > But the box doesn''t even seem to be NUMA (of course it also doesn''t help > that the log level was kept restricted - hint, hint, Philippe), not does there > appear to be any S3 cycle or pCPU bring-up/-down in between... > > Philippe, could you clarify again what CPU model(s) this is being observed on > (the long times between individual steps forward with this problem perhaps > warrant repeating the basics each time, as it''s otherwise quite cumbersome > to always look up old pieces of information). > > > I wonder whether the overflow handling should just be removed, or made > > conditional on a command-line parameter, or on the 32-bit platform > > counter being at least somewhat likely to overflow before a softirq > > occurs -- it seems lots of systems are using 14MHz HPET, and that > > gives us a couple of minutes for the plt_overflow softirq to do its work > before overflow occurs. > > I think we would notice that outage in other ways. :) > > Iirc we added this for a good reason - to cover the, however unlikely, event > of Xen running for very long without preemption. > Presumably most of the cases got fixed meanwhile, and indeed a > wraparound time on the order of minutes should make this superfluous, but > as the case here shows that code did spot a severe anomaly (whatever that > may turn out to be). > > Also recall that there are HPET implementations around that tick at a much > higher frequency than 14MHz. > > So unless we finally reach the understanding that the code is flawed, I would > rather want to keep it. > > Jan
Jan Beulich
2012-Nov-08 13:53 UTC
Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
>>> On 08.11.12 at 11:38, Keir Fraser <keir@xen.org> wrote: > On 08/11/2012 09:39, "Jan Beulich" <JBeulich@suse.com> wrote: > >>>>>> (XEN) XXX plt_overflow: plt_now=5ece12d34128 plt_wrap=5ece12d09306 >>>>>> now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5 >>>>>> plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854 >>>>>> tsc_stamp=e3839fcb0273 >>>>> >>>>> (below is the complete xm dmesg output) >>>>> >>>>> did that help you ? do you need more info ? >>>> >>>> I''ll leave this to Keir (who wrote the debugging patch) to answer but it >>>> looks to me like it should be useful! >>> >>> I''m scratching my head. plt_wrap is earlier than plt_now, which should be >>> impossible. plt_stamp64 oddly has low 32 bits identical to new_stamp. That >>> seems very very improbable! >> >> Is it? My understanding was that plt_stamp64 is just a software >> extension to the more narrow HW counter, and hence the low >> plt_mask bits would always be expected to be identical. > > No, plt_stamp is simply the HW counter time at which plt_stamp64 was last > brought up to date. Hence plt_stamp64 is updated as: > plt_stamp64 += (new_stamp - old_stamp) & plt_mask;I concur: Given that what old_stamp is here was new_stamp for the last update, we should simply have stamp64 = s0 + (s1 - s0) + (s2 - s1) + ... (of course with the mask applied on each addend), which (for the low bits) is the same as just new_stamp. Jan
Keir Fraser
2012-Nov-08 14:04 UTC
Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
On 08/11/2012 13:53, "Jan Beulich" <JBeulich@suse.com> wrote:>>> Is it? My understanding was that plt_stamp64 is just a software >>> extension to the more narrow HW counter, and hence the low >>> plt_mask bits would always be expected to be identical. >> >> No, plt_stamp is simply the HW counter time at which plt_stamp64 was last >> brought up to date. Hence plt_stamp64 is updated as: >> plt_stamp64 += (new_stamp - old_stamp) & plt_mask; > > I concurWell, no, you don''t really. You''re about to point out the flaw in my reasoning...> : Given that what old_stamp is here was new_stamp for > the last update, we should simply have > > stamp64 = s0 + (s1 - s0) + (s2 - s1) + ... > > (of course with the mask applied on each addend), which (for the > low bits) is the same as just new_stamp.Very good point. Silly me. Then the observed value of plt_stamp64 makes perfect sense. -- Keir
Keir Fraser
2012-Nov-08 14:20 UTC
Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
On 07/11/2012 17:40, "Keir Fraser" <keir@xen.org> wrote:> On 07/11/2012 13:22, "Ian Campbell" <ijc@hellion.org.uk> wrote: > >>>> (XEN) XXX plt_overflow: plt_now=5ece12d34128 plt_wrap=5ece12d09306 >>>> now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5 >>>> plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854 >>>> tsc_stamp=e3839fcb0273 >>> >>> (below is the complete xm dmesg output) >>> >>> did that help you ? do you need more info ? >> >> I''ll leave this to Keir (who wrote the debugging patch) to answer but it >> looks to me like it should be useful! > > I''m scratching my head. plt_wrap is earlier than plt_now, which should be > impossible. plt_stamp64 oddly has low 32 bits identical to new_stamp. That > seems very very improbable!Jan has pointed out that the value of plt_stamp64 makes perfect sense, and will in fact always have low 32 bits identical to new_stamp. At least that is explained. So, the question is then why plt_now (== __read_platform_stime(15b800366a5)) is greater than plt_wrap (== __read_platform_stime(15c800366a5)). Perhaps the scale_delta() logic is failing for some reason, but we do use it a lot elsewhere! -- Keir> I wonder whether the overflow handling should just be removed, or made > conditional on a command-line parameter, or on the 32-bit platform counter > being at least somewhat likely to overflow before a softirq occurs -- it > seems lots of systems are using 14MHz HPET, and that gives us a couple of > minutes for the plt_overflow softirq to do its work before overflow occurs. > I think we would notice that outage in other ways. :) > > -- Keir > >
Ian Campbell
2012-Nov-08 14:28 UTC
Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
On Thu, 2012-11-08 at 12:54 +0000, Keir Fraser wrote:> On 08/11/2012 11:43, "Ian Campbell" <ijc@hellion.org.uk> wrote: > > >>> I''ll leave this to Keir (who wrote the debugging patch) to answer but it > >>> looks to me like it should be useful! > >> > >> I''m scratching my head. plt_wrap is earlier than plt_now, which should be > >> impossible. > > > > impossible due to guarantees made by the h/w or by construction in Xen. > > That''s a question, right?Yes, sorry.> By construction in Xen. > > > There appears to be a certain amount of hardware-specificness to the > > issue -- so I''m wondering if maybe there are some platforms whose tsc is > > not as monotonically increasing as it needs to be... > > plt_* timestamps are not derived from TSC at all.I see, rather it is derived from the platform_timesource which could be HPET, pmtimer, pit etc but in this case (according to the provided xm dmesg) appears to be a 14MHz HPET. So I guess s/tsc/HPET/ in my original thought...> >> plt_stamp64 oddly has low 32 bits identical to new_stamp. That > >> seems very very improbable! > > > > Does this code run on all cpus or just one? Is it always the same one? > > Always cpu0.So it''s not cross cpu drift then. Ian.
Ian Campbell
2012-Nov-08 14:29 UTC
Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
On Thu, 2012-11-08 at 13:47 +0000, Philippe.Simonet@swisscom.com wrote:> Hi Mauro, > > that''s a question for you :I think Jan was asking for information relating to the system you saw this on -- or are you working on the same systems as Mauro? Of course additional information from Mauro would be useful too in order to help spotting any patterns.> > Philippe, could you clarify again what CPU model(s) this is being observed on > > (the long times between individual steps forward with this problem perhaps > > warrant repeating the basics each time, as it''s otherwise quite cumbersome > > to always look up old pieces of information). > > can you provide this information ? > cat /proc/cpuinfo > cat /proc/meminfo > hardware information (manufacturer, model, urls, ...) > > Thanks, Philippe > > > > -----Original Message----- > > From: Jan Beulich [mailto:JBeulich@suse.com] > > Sent: Thursday, November 08, 2012 10:40 AM > > To: Simonet Philippe, ITS-OUS-OP-IFM-NW-IPE; Keir Fraser > > Cc: 599161@bugs.debian.org; mrsanna1@gmail.com; Ian Campbell; xen- > > devel@lists.xen.org > > Subject: Re: [Xen-devel] #599161: Xen debug patch for the "clock shifts by 50 > > minutes" bug. > > > > >>> On 07.11.12 at 18:40, Keir Fraser <keir@xen.org> wrote: > > > On 07/11/2012 13:22, "Ian Campbell" <ijc@hellion.org.uk> wrote: > > > > > >>>> (XEN) XXX plt_overflow: plt_now=5ece12d34128 > > plt_wrap=5ece12d09306 > > >>>> now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5 > > >>>> plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854 > > >>>> tsc_stamp=e3839fcb0273 > > >>> > > >>> (below is the complete xm dmesg output) > > >>> > > >>> did that help you ? do you need more info ? > > >> > > >> I''ll leave this to Keir (who wrote the debugging patch) to answer but > > >> it looks to me like it should be useful! > > > > > > I''m scratching my head. plt_wrap is earlier than plt_now, which should > > > be impossible. plt_stamp64 oddly has low 32 bits identical to > > > new_stamp. That seems very very improbable! > > > > Is it? My understanding was that plt_stamp64 is just a software extension to > > the more narrow HW counter, and hence the low plt_mask bits would always > > be expected to be identical. > > > > The plt_wrap < plt_now thing of course is entirely unexplainable to me too: > > Considering that plt_scale doesn''t change at all post- boot, apart from > > memory corruption I could only see an memory access ordering problem to > > be the reason (platform_timer_stamp and/or stime_platform_stamp > > changing despite platform_timer_lock being held. So maybe taking a > > snapshot of all three static values involved in the calculation in > > __read_platform_stime() between acquiring the lock and the first call to > > __read_platform_stime(), and printing them together with the "live" values > > in a second > > printk() after the one your original patch added could rule that out. > > > > But the box doesn''t even seem to be NUMA (of course it also doesn''t help > > that the log level was kept restricted - hint, hint, Philippe), not does there > > appear to be any S3 cycle or pCPU bring-up/-down in between... > > > > Philippe, could you clarify again what CPU model(s) this is being observed on > > (the long times between individual steps forward with this problem perhaps > > warrant repeating the basics each time, as it''s otherwise quite cumbersome > > to always look up old pieces of information). > > > > > I wonder whether the overflow handling should just be removed, or made > > > conditional on a command-line parameter, or on the 32-bit platform > > > counter being at least somewhat likely to overflow before a softirq > > > occurs -- it seems lots of systems are using 14MHz HPET, and that > > > gives us a couple of minutes for the plt_overflow softirq to do its work > > before overflow occurs. > > > I think we would notice that outage in other ways. :) > > > > Iirc we added this for a good reason - to cover the, however unlikely, event > > of Xen running for very long without preemption. > > Presumably most of the cases got fixed meanwhile, and indeed a > > wraparound time on the order of minutes should make this superfluous, but > > as the case here shows that code did spot a severe anomaly (whatever that > > may turn out to be). > > > > Also recall that there are HPET implementations around that tick at a much > > higher frequency than 14MHz. > > > > So unless we finally reach the understanding that the code is flawed, I would > > rather want to keep it. > > > > Jan > >
Keir Fraser
2012-Nov-08 14:34 UTC
Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
On 08/11/2012 14:28, "Ian Campbell" <ijc@hellion.org.uk> wrote:>>> There appears to be a certain amount of hardware-specificness to the >>> issue -- so I''m wondering if maybe there are some platforms whose tsc is >>> not as monotonically increasing as it needs to be... >> >> plt_* timestamps are not derived from TSC at all. > > I see, rather it is derived from the platform_timesource which could be > HPET, pmtimer, pit etc but in this case (according to the provided xm > dmesg) appears to be a 14MHz HPET. > > So I guess s/tsc/HPET/ in my original thought...In fact plt_now and plt_wrap are both derived from the same value of plt_stamp64. One is derived from it directly, and the other from plt_stamp64+plt_mask+1 (== plt_stamp64+(1<<32)). -- Keir
Mauro
2012-Nov-08 15:21 UTC
Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
On 8 November 2012 14:47, <Philippe.Simonet@swisscom.com> wrote:> Hi Mauro, > > that''s a question for you : > >> Philippe, could you clarify again what CPU model(s) this is being observed on >> (the long times between individual steps forward with this problem perhaps >> warrant repeating the basics each time, as it''s otherwise quite cumbersome >> to always look up old pieces of information). > > can you provide this information ? > cat /proc/cpuinfo > cat /proc/meminfo > hardware information (manufacturer, model, urls, ...) >cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU E7330 @ 2.40GHz stepping : 11 cpu MHz : 2400.128 cache size : 3072 KB fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm bogomips : 4800.25 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU E7330 @ 2.40GHz stepping : 11 cpu MHz : 2400.128 cache size : 3072 KB fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm bogomips : 4800.25 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 2 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU E7330 @ 2.40GHz stepping : 11 cpu MHz : 2400.128 cache size : 3072 KB fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm bogomips : 4800.25 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU E7330 @ 2.40GHz stepping : 11 cpu MHz : 2400.128 cache size : 3072 KB fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm bogomips : 4800.25 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 4 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU E7330 @ 2.40GHz stepping : 11 cpu MHz : 2400.128 cache size : 3072 KB fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm bogomips : 4800.25 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 5 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU E7330 @ 2.40GHz stepping : 11 cpu MHz : 2400.128 cache size : 3072 KB fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm bogomips : 4800.25 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 6 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU E7330 @ 2.40GHz stepping : 11 cpu MHz : 2400.128 cache size : 3072 KB fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm bogomips : 4800.25 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 7 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU E7330 @ 2.40GHz stepping : 11 cpu MHz : 2400.128 cache size : 3072 KB fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm bogomips : 4800.25 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 8 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU E7330 @ 2.40GHz stepping : 11 cpu MHz : 2400.128 cache size : 3072 KB fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm bogomips : 4800.25 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 9 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU E7330 @ 2.40GHz stepping : 11 cpu MHz : 2400.128 cache size : 3072 KB fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm bogomips : 4800.25 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 10 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU E7330 @ 2.40GHz stepping : 11 cpu MHz : 2400.128 cache size : 3072 KB fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm bogomips : 4800.25 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 11 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU E7330 @ 2.40GHz stepping : 11 cpu MHz : 2400.128 cache size : 3072 KB fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm bogomips : 4800.25 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 12 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU E7330 @ 2.40GHz stepping : 11 cpu MHz : 2400.128 cache size : 3072 KB fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm bogomips : 4800.25 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 13 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU E7330 @ 2.40GHz stepping : 11 cpu MHz : 2400.128 cache size : 3072 KB fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm bogomips : 4800.25 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 14 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU E7330 @ 2.40GHz stepping : 11 cpu MHz : 2400.128 cache size : 3072 KB fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm bogomips : 4800.25 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 15 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU E7330 @ 2.40GHz stepping : 11 cpu MHz : 2400.128 cache size : 3072 KB fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm bogomips : 4800.25 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: cat /proc/meminfo MemTotal: 3127132 kB MemFree: 2697364 kB Buffers: 65816 kB Cached: 62188 kB SwapCached: 0 kB Active: 117156 kB Inactive: 46840 kB Active(anon): 28668 kB Inactive(anon): 16360 kB Active(file): 88488 kB Inactive(file): 30480 kB Unevictable: 18944 kB Mlocked: 18944 kB SwapTotal: 974840 kB SwapFree: 974840 kB Dirty: 0 kB Writeback: 0 kB AnonPages: 54936 kB Mapped: 14124 kB Shmem: 508 kB Slab: 66072 kB SReclaimable: 19732 kB SUnreclaim: 46340 kB KernelStack: 4016 kB PageTables: 4216 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 2538404 kB Committed_AS: 202468 kB VmallocTotal: 34359738367 kB VmallocUsed: 289904 kB VmallocChunk: 34359384628 kB HardwareCorrupted: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 3145728 kB DirectMap2M: 0 kB note that 3G is the mem reserved to dom0, the total amount of mem in the machine is 64G. The machine is a HP Proliant DL 580 G5, http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?objectID=c01154573&lang=en&cc=us&taskId=&prodSeriesId=3454575&prodTypeId=15351 Thanks to all for the work.
Tim Deegan
2012-Nov-08 16:45 UTC
Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
At 09:39 +0000 on 08 Nov (1352367592), Jan Beulich wrote:> The plt_wrap < plt_now thing of course is entirely unexplainable > to me too: Considering that plt_scale doesn''t change at all post- > boot, apart from memory corruption I could only see an memory > access ordering problem to be the reason (platform_timer_stamp > and/or stime_platform_stamp changing despite platform_timer_lock > being held. So maybe taking a snapshot of all three static values > involved in the calculation in __read_platform_stime() between > acquiring the lock and the first call to __read_platform_stime(), > and printing them together with the "live" values in a second > printk() after the one your original patch added could rule that > out. > > But the box doesn''t even seem to be NUMA (of course it also > doesn''t help that the log level was kept restricted - hint, hint, > Philippe), not does there appear to be any S3 cycle or pCPU > bring-up/-down in between...S3 looks like it might be a culprit, since resume_platform_timer() clobbers plt_stamp64 without taking the platform_timer_lock. But both the S3 resume code and the plt_overflow timer should only ever run on CPU 0, so even that should be safe (unless continue_hypercall_on_cpu() is broken...) Definitely having loglvl=all would have helped here, to eliminate S3 from our enquiries.> > I wonder whether the overflow handling should just be removed, or made > > conditional on a command-line parameter, or on the 32-bit platform counter > > being at least somewhat likely to overflow before a softirq occurs -- it > > seems lots of systems are using 14MHz HPET, and that gives us a couple of > > minutes for the plt_overflow softirq to do its work before overflow occurs. > > I think we would notice that outage in other ways. :) > > Iirc we added this for a good reason - to cover the, however > unlikely, event of Xen running for very long without preemption. > Presumably most of the cases got fixed meanwhile, and indeed > a wraparound time on the order of minutes should make this > superfluous, but as the case here shows that code did spot a > severe anomaly (whatever that may turn out to be).ISTR when this code went in we were dealing with a timer that had a period of about 4 seconds (ACPI PMTIMER?). It might well be OTT for the HPET, but if there''s something weird going on I''d like to track it down while we have some sort of a handle on it. Tim.
Keir Fraser
2012-Nov-08 16:53 UTC
Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
On 08/11/2012 16:45, "Tim Deegan" <tim@xen.org> wrote:>>> I wonder whether the overflow handling should just be removed, or made >>> conditional on a command-line parameter, or on the 32-bit platform counter >>> being at least somewhat likely to overflow before a softirq occurs -- it >>> seems lots of systems are using 14MHz HPET, and that gives us a couple of >>> minutes for the plt_overflow softirq to do its work before overflow occurs. >>> I think we would notice that outage in other ways. :) >> >> Iirc we added this for a good reason - to cover the, however >> unlikely, event of Xen running for very long without preemption. >> Presumably most of the cases got fixed meanwhile, and indeed >> a wraparound time on the order of minutes should make this >> superfluous, but as the case here shows that code did spot a >> severe anomaly (whatever that may turn out to be). > > ISTR when this code went in we were dealing with a timer that had a > period of about 4 seconds (ACPI PMTIMER?). It might well be OTT for the > HPET, but if there''s something weird going on I''d like to track it down > while we have some sort of a handle on it.It must have been the PMTIMER. It''s the only counter narrower than 32 bits (legacy PIT we simulate as a 32-bit counter behind the scenes). -- Keir
<Philippe.Simonet@swisscom.com>
2012-Nov-09 09:05 UTC
Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
> -----Original Message----- > From: Ian Campbell [mailto:ijc@hellion.org.uk] > Sent: Thursday, November 08, 2012 3:29 PM > To: Simonet Philippe, ITS-OUS-OP-IFM-NW-IPE > Cc: mrsanna1@gmail.com; 599161@bugs.debian.org; xen- > devel@lists.xen.org; keir@xen.org; JBeulich@suse.com > Subject: Re: [Xen-devel] #599161: Xen debug patch for the "clock shifts by 50 minutes" bug. > > > I think Jan was asking for information relating to the system you saw this on - > - or are you working on the same systems as Mauro?oops, excuse me, here is a description : I have the problem on 4 systems, all with same hardware. the problem occured on each system, 1 time each 2 month in average. since January 2012, I decided to reboot them all monthly, and the clock jump occurred only once in February ... SYSTEM : HP ProLiant DL385 G7, with 2 * AMD Processor 6174 (12 cores) = 24 cores, 16 GB MEMORY XEN (XEN) Xen version 4.0.1 (Debian 4.0.1-5.4) (ultrotter@debian.org) (gcc version 4.4.5 (Debian 4.4.5-8) ) Sat Sep 8 19:15:46 UTC 2012 DOM0 Linux 2.6.32-5-xen-amd64 #1 SMP Sun Sep 23 13:49:30 UTC 2012 x86_64 GNU/Linux CPU processor : 0 vendor_id : AuthenticAMD cpu family : 16 model : 9 model name : AMD Opteron(tm) Processor 6174 stepping : 1 cpu MHz : 3791872.477 cache size : 512 KB fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu de tsc msr pae mce cx8 apic mtrr mca cmov pat clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid amd_dcm pni cx16 popcnt hypervisor lahf_lm cmp_legacy extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch nodeid_msr bogomips : 4400.17 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 48 bits physical, 48 bits virtual power management: ts ttp tm stc 100mhzsteps hwpstate> > Of course additional information from Mauro would be useful too in order to > help spotting any patterns. > > > > Philippe, could you clarify again what CPU model(s) this is being > > > observed on (the long times between individual steps forward with > > > this problem perhaps warrant repeating the basics each time, as it''s > > > otherwise quite cumbersome to always look up old pieces of > information). > > > > can you provide this information ? > > cat /proc/cpuinfo > > cat /proc/meminfo > > hardware information (manufacturer, model, urls, ...) > > > > Thanks, Philippe > > > > > > > -----Original Message----- > > > From: Jan Beulich [mailto:JBeulich@suse.com] > > > Sent: Thursday, November 08, 2012 10:40 AM > > > To: Simonet Philippe, ITS-OUS-OP-IFM-NW-IPE; Keir Fraser > > > Cc: 599161@bugs.debian.org; mrsanna1@gmail.com; Ian Campbell; xen- > > > devel@lists.xen.org > > > Subject: Re: [Xen-devel] #599161: Xen debug patch for the "clock > > > shifts by 50 minutes" bug. > > > > > > >>> On 07.11.12 at 18:40, Keir Fraser <keir@xen.org> wrote: > > > > On 07/11/2012 13:22, "Ian Campbell" <ijc@hellion.org.uk> wrote: > > > > > > > >>>> (XEN) XXX plt_overflow: plt_now=5ece12d34128 > > > plt_wrap=5ece12d09306 > > > >>>> now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5 > > > >>>> plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854 > > > >>>> tsc_stamp=e3839fcb0273 > > > >>> > > > >>> (below is the complete xm dmesg output) > > > >>> > > > >>> did that help you ? do you need more info ? > > > >> > > > >> I''ll leave this to Keir (who wrote the debugging patch) to answer > > > >> but it looks to me like it should be useful! > > > > > > > > I''m scratching my head. plt_wrap is earlier than plt_now, which > > > > should be impossible. plt_stamp64 oddly has low 32 bits identical > > > > to new_stamp. That seems very very improbable! > > > > > > Is it? My understanding was that plt_stamp64 is just a software > > > extension to the more narrow HW counter, and hence the low plt_mask > > > bits would always be expected to be identical. > > > > > > The plt_wrap < plt_now thing of course is entirely unexplainable to me > too: > > > Considering that plt_scale doesn''t change at all post- boot, apart > > > from memory corruption I could only see an memory access ordering > > > problem to be the reason (platform_timer_stamp and/or > > > stime_platform_stamp changing despite platform_timer_lock being > > > held. So maybe taking a snapshot of all three static values involved > > > in the calculation in > > > __read_platform_stime() between acquiring the lock and the first > > > call to __read_platform_stime(), and printing them together with the > > > "live" values in a second > > > printk() after the one your original patch added could rule that out. > > > > > > But the box doesn''t even seem to be NUMA (of course it also doesn''t > > > help that the log level was kept restricted - hint, hint, Philippe), > > > not does there appear to be any S3 cycle or pCPU bring-up/-down in > between... > > > > > > Philippe, could you clarify again what CPU model(s) this is being > > > observed on (the long times between individual steps forward with > > > this problem perhaps warrant repeating the basics each time, as it''s > > > otherwise quite cumbersome to always look up old pieces of > information). > > > > > > > I wonder whether the overflow handling should just be removed, or > > > > made conditional on a command-line parameter, or on the 32-bit > > > > platform counter being at least somewhat likely to overflow before > > > > a softirq occurs -- it seems lots of systems are using 14MHz HPET, > > > > and that gives us a couple of minutes for the plt_overflow softirq > > > > to do its work > > > before overflow occurs. > > > > I think we would notice that outage in other ways. :) > > > > > > Iirc we added this for a good reason - to cover the, however > > > unlikely, event of Xen running for very long without preemption. > > > Presumably most of the cases got fixed meanwhile, and indeed a > > > wraparound time on the order of minutes should make this > > > superfluous, but as the case here shows that code did spot a severe > > > anomaly (whatever that may turn out to be). > > > > > > Also recall that there are HPET implementations around that tick at > > > a much higher frequency than 14MHz. > > > > > > So unless we finally reach the understanding that the code is > > > flawed, I would rather want to keep it. > > > > > > Jan > > > > >
Jan Beulich
2012-Nov-09 09:47 UTC
Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
>>> On 09.11.12 at 10:05, <Philippe.Simonet@swisscom.com> wrote: > oops, excuse me, here is a description : I have the problem on 4 systems, > all with same hardware. > the problem occured on each system, 1 time each 2 month in average. since > January 2012, I decided to reboot them all monthly, > and the clock jump occurred only once in February ... > > SYSTEM : HP ProLiant DL385 G7, with 2 * AMD Processor 6174 (12 cores) = 24 > cores, 16 GB MEMORY > XEN (XEN) Xen version 4.0.1 (Debian 4.0.1-5.4) (ultrotter@debian.org) > (gcc version 4.4.5 (Debian 4.4.5-8) ) Sat Sep 8 19:15:46 UTC 2012 > DOM0 Linux 2.6.32-5-xen-amd64 #1 SMP Sun Sep 23 13:49:30 UTC 2012 x86_64 > GNU/Linux > CPU > processor : 0 > vendor_id : AuthenticAMD > cpu family : 16 > model : 9 > model name : AMD Opteron(tm) Processor 6174Huh - so we have the problem on even different vendor CPUs (as Mauro''s are Intel ones). But I take it that you haven''t see an event yet with the debugging patch? Plus, what''s puzzling me a little too - before the occurrence of the event on Mauro''s system, I was under the impression that this requires quite a bit of uptime. Yet the event he observed occurred early on the second day after boot afaict. Jan
Mauro
2012-Nov-09 09:54 UTC
Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
On 9 November 2012 10:47, Jan Beulich <JBeulich@suse.com> wrote:>>>> On 09.11.12 at 10:05, <Philippe.Simonet@swisscom.com> wrote: >> oops, excuse me, here is a description : I have the problem on 4 systems, >> all with same hardware. >> the problem occured on each system, 1 time each 2 month in average. since >> January 2012, I decided to reboot them all monthly, >> and the clock jump occurred only once in February ... >> >> SYSTEM : HP ProLiant DL385 G7, with 2 * AMD Processor 6174 (12 cores) = 24 >> cores, 16 GB MEMORY >> XEN (XEN) Xen version 4.0.1 (Debian 4.0.1-5.4) (ultrotter@debian.org) >> (gcc version 4.4.5 (Debian 4.4.5-8) ) Sat Sep 8 19:15:46 UTC 2012 >> DOM0 Linux 2.6.32-5-xen-amd64 #1 SMP Sun Sep 23 13:49:30 UTC 2012 x86_64 >> GNU/Linux >> CPU >> processor : 0 >> vendor_id : AuthenticAMD >> cpu family : 16 >> model : 9 >> model name : AMD Opteron(tm) Processor 6174 > > Huh - so we have the problem on even different vendor CPUs (as > Mauro''s are Intel ones). But I take it that you haven''t see an event > yet with the debugging patch? > > Plus, what''s puzzling me a little too - before the occurrence of the > event on Mauro''s system, I was under the impression that this > requires quite a bit of uptime. Yet the event he observed occurred > early on the second day after boot afaict.Before the patch clock jumps on my systems occurred about once or twice at week.
Jan Beulich
2012-Nov-13 10:14 UTC
Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
>>> On 09.11.12 at 10:05, <Philippe.Simonet@swisscom.com> wrote:Since it looks like this got stalled again, attached is a slightly extended version of Keir''s debugging patch, allowing to rule out any inconsistencies of the globals between the first and second instances of the two invocations of __read_platform_stime(). Should the numbers printed turn out identical between the two invocations and identical to the boot time determined values, then I''m afraid I''m out of explanations as well as debugging suggestions. Please remember to add "loglvl=all" to the hypervisor command line. The patch is against a 4.0.3 based tree I had still lying around, so I hope it''ll apply cleanly to your 4.0.1 based one. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Jan Beulich
2012-Nov-26 10:20 UTC
Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
>>> On 24.11.12 at 13:06, Mauro <mrsanna1@gmail.com> wrote: > (XEN) XXX plt_overflow: plt_now=569ed0c10fa3 plt_wrap=569ed0bd2ac3 > now=569ed0bf15a4 old_stamp=36f62 new_stamp=800377ce > plt_stamp64=13d800377ce plt_mask=ffffffff tsc=cfdf69d72c82 > tsc_stamp=cfdf69ccc91d > (XEN) XXX pts=13d80036988/13d80036988 sps=569ed0bd2ac3/569ed0bd2ac3 > (XEN) XXX ps=7:8baebcb8/7:8baebcb8I.e. plt_wrap == stime_platform_stamp, which would imply plt_stamp64 + plt_mask + 1 == platform_timer_stamp, which visibly isn''t the case. Having to consider mis-compilation of some sort at this point, can you please attach the xen-syms matching the xen.gz from the run above? The only other alternative would appear to be register or stack corruption (which of the two might also be visible from the xen-syms, so it would be needed for that determination too). As the code in question runs with interrupts disabled, the only sources for that would be NMI, MCE, or SMI. Adding logging of instances of the former two would be possible, but otoh the entry/exit code of them is uniform with other exceptions and hence can pretty much be excluded to corrupt anything. Jan
Mauro
2012-Nov-26 10:28 UTC
Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
On 26 November 2012 11:20, Jan Beulich <JBeulich@suse.com> wrote:>>>> On 24.11.12 at 13:06, Mauro <mrsanna1@gmail.com> wrote: >> (XEN) XXX plt_overflow: plt_now=569ed0c10fa3 plt_wrap=569ed0bd2ac3 >> now=569ed0bf15a4 old_stamp=36f62 new_stamp=800377ce >> plt_stamp64=13d800377ce plt_mask=ffffffff tsc=cfdf69d72c82 >> tsc_stamp=cfdf69ccc91d >> (XEN) XXX pts=13d80036988/13d80036988 sps=569ed0bd2ac3/569ed0bd2ac3 >> (XEN) XXX ps=7:8baebcb8/7:8baebcb8 > > I.e. plt_wrap == stime_platform_stamp, which would imply > plt_stamp64 + plt_mask + 1 == platform_timer_stamp, which > visibly isn''t the case. > > Having to consider mis-compilation of some sort at this point, > can you please attach the xen-syms matching the xen.gz from > the run above?how can I do?
Ian Campbell
2012-Nov-26 10:40 UTC
Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
On Mon, 2012-11-26 at 11:28 +0100, Mauro wrote:> On 26 November 2012 11:20, Jan Beulich <JBeulich@suse.com> wrote: > >>>> On 24.11.12 at 13:06, Mauro <mrsanna1@gmail.com> wrote: > >> (XEN) XXX plt_overflow: plt_now=569ed0c10fa3 plt_wrap=569ed0bd2ac3 > >> now=569ed0bf15a4 old_stamp=36f62 new_stamp=800377ce > >> plt_stamp64=13d800377ce plt_mask=ffffffff tsc=cfdf69d72c82 > >> tsc_stamp=cfdf69ccc91d > >> (XEN) XXX pts=13d80036988/13d80036988 sps=569ed0bd2ac3/569ed0bd2ac3 > >> (XEN) XXX ps=7:8baebcb8/7:8baebcb8 > > > > I.e. plt_wrap == stime_platform_stamp, which would imply > > plt_stamp64 + plt_mask + 1 == platform_timer_stamp, which > > visibly isn''t the case. > > > > Having to consider mis-compilation of some sort at this point, > > can you please attach the xen-syms matching the xen.gz from > > the run above? > > how can I do?xen/xen-syms should be in the place where you built the hypervisor. -- Ian Campbell It is not doing the thing we like to do, but liking the thing we have to do, that makes life blessed. -- Goethe
Jan Beulich
2012-Nov-26 12:18 UTC
Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
>>> On 26.11.12 at 11:40, Ian Campbell <ijc@hellion.org.uk> wrote: > On Mon, 2012-11-26 at 11:28 +0100, Mauro wrote: >> On 26 November 2012 11:20, Jan Beulich <JBeulich@suse.com> wrote: >> >>>> On 24.11.12 at 13:06, Mauro <mrsanna1@gmail.com> wrote: >> >> (XEN) XXX plt_overflow: plt_now=569ed0c10fa3 plt_wrap=569ed0bd2ac3 >> >> now=569ed0bf15a4 old_stamp=36f62 new_stamp=800377ce >> >> plt_stamp64=13d800377ce plt_mask=ffffffff tsc=cfdf69d72c82 >> >> tsc_stamp=cfdf69ccc91d >> >> (XEN) XXX pts=13d80036988/13d80036988 sps=569ed0bd2ac3/569ed0bd2ac3 >> >> (XEN) XXX ps=7:8baebcb8/7:8baebcb8 >> > >> > I.e. plt_wrap == stime_platform_stamp, which would imply >> > plt_stamp64 + plt_mask + 1 == platform_timer_stamp, which >> > visibly isn''t the case. >> > >> > Having to consider mis-compilation of some sort at this point, >> > can you please attach the xen-syms matching the xen.gz from >> > the run above? >> >> how can I do? > > xen/xen-syms should be in the place where you built the hypervisor.... or, if you didn''t build the hypervisor yourself, you should check the package you were handed. Jan
<Philippe.Simonet@swisscom.com>
2012-Nov-26 13:08 UTC
Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
Hi jan I''ve done that for Mauro, here are the symbols : http://www.ncube.ch/xen-syms.gz MANY thanks and regards Philippe> -----Original Message----- > From: Jan Beulich [mailto:JBeulich@suse.com] > Sent: Monday, November 26, 2012 1:18 PM > To: Mauro > Cc: Keir Fraser; Ian Campbell; xen-devel; Simonet Philippe, ITS-OUS-OP- > IFM-NW-IPE > Subject: Re: [Xen-devel] #599161: Xen debug patch for the "clock shifts by 50 > minutes" bug. > > >>> On 26.11.12 at 11:40, Ian Campbell <ijc@hellion.org.uk> wrote: > > On Mon, 2012-11-26 at 11:28 +0100, Mauro wrote: > >> On 26 November 2012 11:20, Jan Beulich <JBeulich@suse.com> wrote: > >> >>>> On 24.11.12 at 13:06, Mauro <mrsanna1@gmail.com> wrote: > >> >> (XEN) XXX plt_overflow: plt_now=569ed0c10fa3 > plt_wrap=569ed0bd2ac3 > >> >> now=569ed0bf15a4 old_stamp=36f62 new_stamp=800377ce > >> >> plt_stamp64=13d800377ce plt_mask=ffffffff tsc=cfdf69d72c82 > >> >> tsc_stamp=cfdf69ccc91d > >> >> (XEN) XXX pts=13d80036988/13d80036988 > >> >> sps=569ed0bd2ac3/569ed0bd2ac3 > >> >> (XEN) XXX ps=7:8baebcb8/7:8baebcb8 > >> > > >> > I.e. plt_wrap == stime_platform_stamp, which would imply > >> > plt_stamp64 + plt_mask + 1 == platform_timer_stamp, which visibly > >> > isn''t the case. > >> > > >> > Having to consider mis-compilation of some sort at this point, can > >> > you please attach the xen-syms matching the xen.gz from the run > >> > above? > >> > >> how can I do? > > > > xen/xen-syms should be in the place where you built the hypervisor. > > ... or, if you didn''t build the hypervisor yourself, you should check the > package you were handed. > > Jan
Sylvain Munaut
2012-Nov-26 14:15 UTC
Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
Hi, I just wanted to say I''ve also been hitting this problem for years on some servers. That was originally under xen 4.0 when I originally installed the machines, but even after update to 4.1.3 and more recently a 3.6.4 dom0 kernel, the problem is still present. More specifically the two hw platforms I''m sure are concerned (hw + cpu info + debug-key ''s'' info) Sun Fire x2250: cpu family : 6 model : 23 model name : Intel(R) Xeon(R) CPU L5420 @ 2.50GHz stepping : 6 microcode : 0x60b (XEN) TSC has constant rate, deep Cstates possible, so not reliable, warp=3292 (count=1) (XEN) dom1: mode=0,ofs=0x56eb699774,khz=2500128,inc=1,vtsc count: 1168090 kernel, 18678 user Sun Fire x4150: cpu family : 6 model : 23 model name : Intel(R) Xeon(R) CPU E5450 @ 3.00GHz stepping : 10 microcode : 0xa07 (XEN) TSC has constant rate, deep Cstates possible, so not reliable, warp=3789 (count=1) (XEN) dom2: mode=0,ofs=0x3a652aed7d,khz=2992564,inc=1,vtsc count: 4247396065 kernel, 712543 user (XEN) dom3: mode=0,ofs=0x9eb4c44593,khz=2500147,inc=2,vtsc count: 1967559486 kernel, 192516 user When the problem appears, I can also see the "(XEN) Platform timer appears to have unexpectedly wrapped 10 or more times." message. I have now deployed a freshly compiled 4.1.3 with the latest tsc debug patched compiled and loglvl=all option to a test server with the same hw where the problem happenned and I''ll report if I see anything. The problem is pretty rare so not sure when it will happen next ... Cheers, Sylvain
Jan Beulich
2012-Nov-26 14:29 UTC
Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
>>> On 26.11.12 at 14:08, <Philippe.Simonet@swisscom.com> wrote: > I''ve done that for Mauro, here are the symbols : > http://www.ncube.ch/xen-syms.gzThat helped; I''ll send a patch shortly. Jan