thr3ads.net - Xen devel - #599161: Xen debug patch for the "clock shifts by 50 minutes" bug. [Oct 2012]

If this information is useful, please help other people find it:
Share via:

Ian Campbell

2012-Oct-26 12:59 UTC

#599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

Hi all,

I''ve BCC''d a number of people who have reported seeing this
bug at
various times in the past.

If you can still repro I''d appreciate it if you could give the patch in
http://marc.info/?l=xen-devel&m=135049062216685&w=2 (also attached) a go
and report back success/failure and the output of the debugging messages
produced.

Thanks,
Ian.

-- 
Ian Campbell
Current Noise: Death - Evil Dead

Executive ability is prominent in your make-up.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Mauro

2012-Oct-26 18:25 UTC

head link

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

On 26 October 2012 14:59, Ian Campbell <ijc@hellion.org.uk>
wrote:> Hi all,
>
> I''ve BCC''d a number of people who have reported seeing
this bug at
> various times in the past.
>
> If you can still repro I''d appreciate it if you could give the
patch in
> http://marc.info/?l=xen-devel&m=135049062216685&w=2 (also attached)
a go
> and report back success/failure and the output of the debugging messages
> produced.
Is that patch for amd64 architectures?

Ian Campbell

2012-Oct-26 18:40 UTC

head link

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

On Fri, 2012-10-26 at 19:25 +0100, Mauro wrote:> On 26 October 2012 14:59, Ian Campbell <ijc@hellion.org.uk> wrote:
> > Hi all,
> >
> > I''ve BCC''d a number of people who have reported
seeing this bug at
> > various times in the past.
> >
> > If you can still repro I''d appreciate it if you could give
the patch in
> > http://marc.info/?l=xen-devel&m=135049062216685&w=2 (also
attached) a go
> > and report back success/failure and the output of the debugging
messages
> > produced.
> 
> Is that patch for amd64 architectures?
It is for 32 and 64 bit x86, so yes.

BTW, you were the original recipient of this patch in the thread linked
above.

Ian.
-- 
Ian Campbell


"I''d love to go out with you, but I''m having all my
plants neutered."

<Philippe.Simonet@swisscom.com>

2012-Nov-07 10:10 UTC

head link

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

Hi Ian

i compiled  a patched hypervisor for Mauro, it is running since many days and
the overflow occured,
without clock jumps
> (XEN) XXX plt_overflow: plt_now=5ece12d34128 plt_wrap=5ece12d09306
> now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5
> plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854
> tsc_stamp=e3839fcb0273
(below is the complete xm dmesg output)

did that help you ? do you need more info ? 

thanks and regards

Philippe

> -----Original Message-----
> From: Mauro [mailto:mrsanna1@gmail.com]
> Sent: Wednesday, November 07, 2012 10:12 AM
> To: Simonet Philippe, ITS-OUS-OP-IFM-NW-IPE
> Subject: Re: [Xen-devel] [Xen-users] Re: Xen 4 TSC problems
> 
> Hello, no news until now there aren''t clock jumps.
> Here is xm dmesg:
> 
> xm dmesg
> (XEN) Xen version 4.0.1 (Debian 4.0.1-5.4) (ultrotter@debian.org) (gcc
> version 4.4.5 (Debian 4.4.5-8) ) Mon Oct 29 14:42:12 CET 2012
> (XEN) Bootloader: GRUB 1.98+20100804-14+squeeze1
> (XEN) Command line: placeholder dom0_mem=3072M loglvl=warning
> guest_loglvl=warning
> (XEN) Video information:
> (XEN)  VGA is text mode 80x25, font 8x16
> (XEN)  VBE/DDC methods: V2; EDID transfer time: 2 seconds
> (XEN) Disc information:
> (XEN)  Found 2 MBR signatures
> (XEN)  Found 2 EDD information structures
> (XEN) Xen-e820 RAM map:
> (XEN)  0000000000000000 - 000000000009f400 (usable)
> (XEN)  000000000009f400 - 00000000000a0000 (reserved)
> (XEN)  00000000000f0000 - 0000000000100000 (reserved)
> (XEN)  0000000000100000 - 00000000cfd43000 (usable)
> (XEN)  00000000cfd43000 - 00000000cfd4c000 (ACPI data)
> (XEN)  00000000cfd4c000 - 00000000cfd4d000 (usable)
> (XEN)  00000000cfd4d000 - 00000000d0000000 (reserved)
> (XEN)  00000000e0000000 - 00000000f0000000 (reserved)
> (XEN)  00000000fec00000 - 00000000fed00000 (reserved)
> (XEN)  00000000fee00000 - 00000000fee10000 (reserved)
> (XEN)  00000000ffc00000 - 0000000100000000 (reserved)
> (XEN)  0000000100000000 - 000000102ffff000 (usable)
> (XEN) ACPI: RSDP 000F4F20, 0024 (r2 HP    )
> (XEN) ACPI: XSDT CFD43900, 007C (r1 HP     ProLiant        2         162E)
> (XEN) ACPI: FACP CFD439C0, 00F4 (r3 HP     ProLiant        2         162E)
> (XEN) ACPI: DSDT CFD43AC0, 30C9 (r1 HP         DSDT        1 INTL 20030228)
> (XEN) ACPI: FACS CFD43100, 0040
> (XEN) ACPI: SPCR CFD43140, 0050 (r1 HP     SPCRRBSU        1         162E)
> (XEN) ACPI: MCFG CFD431C0, 003C (r1 HP     ProLiant        1             0)
> (XEN) ACPI: HPET CFD43200, 0038 (r1 HP     ProLiant        2         162E)
> (XEN) ACPI: FFFF CFD43240, 0064 (r2 HP     P61             2         162E)
> (XEN) ACPI: SPMI CFD432C0, 0040 (r5 HP     ProLiant        1         162E)
> (XEN) ACPI: ERST CFD43300, 01D0 (r1 HP     ProLiant        1         162E)
> (XEN) ACPI: APIC CFD43500, 0176 (r1 HP     ProLiant        2             0)
> (XEN) ACPI: FFFF CFD43680, 0176 (r1 HP     ProLiant        1         162E)
> (XEN) ACPI: BERT CFD43800, 0030 (r1 HP     ProLiant        1         162E)
> (XEN) ACPI: HEST CFD43840, 00BC (r1 HP     ProLiant        1         162E)
> (XEN) System RAM: 65532MB (67105672kB)
> (XEN) Domain heap initialised
> (XEN) Processor #0 6:15 APIC version 20
> (XEN) Processor #8 6:15 APIC version 20
> (XEN) Processor #16 6:15 APIC version 20
> (XEN) Processor #24 6:15 APIC version 20
> (XEN) Processor #1 6:15 APIC version 20
> (XEN) Processor #9 6:15 APIC version 20
> (XEN) Processor #17 6:15 APIC version 20
> (XEN) Processor #25 6:15 APIC version 20
> (XEN) Processor #2 6:15 APIC version 20
> (XEN) Processor #10 6:15 APIC version 20
> (XEN) Processor #18 6:15 APIC version 20
> (XEN) Processor #26 6:15 APIC version 20
> (XEN) Processor #3 6:15 APIC version 20
> (XEN) Processor #11 6:15 APIC version 20
> (XEN) Processor #19 6:15 APIC version 20
> (XEN) Processor #27 6:15 APIC version 20
> (XEN) IOAPIC[0]: apic_id 1, version 32, address 0xfec00000, GSI 0-23
> (XEN) IOAPIC[1]: apic_id 2, version 32, address 0xfec80000, GSI 24-47
> (XEN) IOAPIC[2]: apic_id 3, version 32, address 0xfec81000, GSI 48-71
> (XEN) IOAPIC[3]: apic_id 4, version 32, address 0xfec81800, GSI 72-95
> (XEN) Enabling APIC mode:  Phys.  Using 4 I/O APICs
> (XEN) Using scheduler: SMP Credit Scheduler (credit)
> (XEN) Detected 2400.128 MHz processor.
> (XEN) Initing memory sharing.
> (XEN) VMX: Supported advanced features:
> (XEN)  - APIC MMIO access virtualisation
> (XEN)  - APIC TPR shadow
> (XEN)  - Virtual NMI
> (XEN)  - MSR direct-access bitmap
> (XEN) HVM: ASIDs disabled.
> (XEN) HVM: VMX enabled
> (XEN) I/O virtualisation disabled
> (XEN) Total of 16 processors activated.
> (XEN) ENABLING IO-APIC IRQs
> (XEN)  -> Using new ACK method
> (XEN) checking TSC synchronization across 16 CPUs:
> (XEN) CPU#14 had 3 usecs TSC skew, fixed it up.
> (XEN) Platform timer is 14.318MHz HPET
> (XEN) Allocated console ring of 32 KiB.
> (XEN) Brought up 16 CPUs
> (XEN) *** LOADING DOMAIN 0 ***
> (XEN)  Xen  kernel: 64-bit, lsb, compat32
> (XEN)  Dom0 kernel: 64-bit, PAE, lsb, paddr 0x1000000 -> 0x1708000
> (XEN) PHYSICAL MEMORY ARRANGEMENT:
> (XEN)  Dom0 alloc.:   000000083c000000->0000000840000000 (770048 pages
> to be allocated)
> (XEN) VIRTUAL MEMORY ARRANGEMENT:
> (XEN)  Loaded kernel: ffffffff81000000->ffffffff81708000
> (XEN)  Init. ramdisk: ffffffff81708000->ffffffff81efb000
> (XEN)  Phys-Mach map: ffffffff81efb000->ffffffff824fb000
> (XEN)  Start info:    ffffffff824fb000->ffffffff824fb4b4
> (XEN)  Page tables:   ffffffff824fc000->ffffffff82513000
> (XEN)  Boot stack:    ffffffff82513000->ffffffff82514000
> (XEN)  TOTAL:         ffffffff80000000->ffffffff82800000
> (XEN)  ENTRY ADDRESS: ffffffff81531200
> (XEN) Dom0 has maximum 16 VCPUs
> (XEN) Scrubbing Free RAM:
>
.....................................................................................................................
>
.....................................................................................................................
>
.....................................................................................................................
>
.....................................................................................................................
>
.....................................................................................................................
> ................................done.
> (XEN) Xen trace buffers: disabled
> (XEN) Std. Loglevel: Errors and warnings
> (XEN) Guest Loglevel: Errors and warnings
> (XEN) Xen is relinquishing VGA console.
> (XEN) *** Serial input -> DOM0 (type ''CTRL-a'' three
times to switch input to
> Xen)
> (XEN) Freed 176kB init memory.
> (XEN) XXX plt_overflow: plt_now=5ece12d34128 plt_wrap=5ece12d09306
> now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5
> plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854
> tsc_stamp=e3839fcb0273
> 
> 



> -----Original Message-----
> From: xen-devel-bounces@lists.xen.org [mailto:xen-devel-
> bounces@lists.xen.org] On Behalf Of Ian Campbell
> Sent: Friday, October 26, 2012 3:00 PM
> To: xen-devel@lists.xen.org
> Cc: 599161@bugs.debian.org
> Subject: [Xen-devel] #599161: Xen debug patch for the "clock shifts by
50
> minutes" bug.
> 
> Hi all,
> 
> I''ve BCC''d a number of people who have reported seeing
this bug at various
> times in the past.
> 
> If you can still repro I''d appreciate it if you could give the
patch in
> http://marc.info/?l=xen-devel&m=135049062216685&w=2 (also attached)
a
> go and report back success/failure and the output of the debugging
> messages produced.
> 
> Thanks,
> Ian.
> 
> --
> Ian Campbell
> Current Noise: Death - Evil Dead
> 
> Executive ability is prominent in your make-up.

Ian Campbell

2012-Nov-07 13:22 UTC

head link

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

On Wed, 2012-11-07 at 10:10 +0000, Philippe.Simonet@swisscom.com
wrote:> Hi Ian
Thanks for doing this test.
> i compiled  a patched hypervisor for Mauro, it is running since many
> days and the overflow occured, without clock jumps
So just to be clear you saw this logging occur *without* the 50 minute
jump in time? That''s good!
> > (XEN) XXX plt_overflow: plt_now=5ece12d34128 plt_wrap=5ece12d09306
> > now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5
> > plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854
> > tsc_stamp=e3839fcb0273
> 
> (below is the complete xm dmesg output)
> 
> did that help you ? do you need more info ? 
I''ll leave this to Keir (who wrote the debugging patch) to answer but
it
looks to me like it should be useful!

Thanks again.

Ian.
> thanks and regards
> 
> Philippe
> 
> 
> > -----Original Message-----
> > From: Mauro [mailto:mrsanna1@gmail.com]
> > Sent: Wednesday, November 07, 2012 10:12 AM
> > To: Simonet Philippe, ITS-OUS-OP-IFM-NW-IPE
> > Subject: Re: [Xen-devel] [Xen-users] Re: Xen 4 TSC problems
> > 
> > Hello, no news until now there aren''t clock jumps.
> > Here is xm dmesg:
> > 
> > xm dmesg
> > (XEN) Xen version 4.0.1 (Debian 4.0.1-5.4) (ultrotter@debian.org) (gcc
> > version 4.4.5 (Debian 4.4.5-8) ) Mon Oct 29 14:42:12 CET 2012
> > (XEN) Bootloader: GRUB 1.98+20100804-14+squeeze1
> > (XEN) Command line: placeholder dom0_mem=3072M loglvl=warning
> > guest_loglvl=warning
> > (XEN) Video information:
> > (XEN)  VGA is text mode 80x25, font 8x16
> > (XEN)  VBE/DDC methods: V2; EDID transfer time: 2 seconds
> > (XEN) Disc information:
> > (XEN)  Found 2 MBR signatures
> > (XEN)  Found 2 EDD information structures
> > (XEN) Xen-e820 RAM map:
> > (XEN)  0000000000000000 - 000000000009f400 (usable)
> > (XEN)  000000000009f400 - 00000000000a0000 (reserved)
> > (XEN)  00000000000f0000 - 0000000000100000 (reserved)
> > (XEN)  0000000000100000 - 00000000cfd43000 (usable)
> > (XEN)  00000000cfd43000 - 00000000cfd4c000 (ACPI data)
> > (XEN)  00000000cfd4c000 - 00000000cfd4d000 (usable)
> > (XEN)  00000000cfd4d000 - 00000000d0000000 (reserved)
> > (XEN)  00000000e0000000 - 00000000f0000000 (reserved)
> > (XEN)  00000000fec00000 - 00000000fed00000 (reserved)
> > (XEN)  00000000fee00000 - 00000000fee10000 (reserved)
> > (XEN)  00000000ffc00000 - 0000000100000000 (reserved)
> > (XEN)  0000000100000000 - 000000102ffff000 (usable)
> > (XEN) ACPI: RSDP 000F4F20, 0024 (r2 HP    )
> > (XEN) ACPI: XSDT CFD43900, 007C (r1 HP     ProLiant        2        
162E)
> > (XEN) ACPI: FACP CFD439C0, 00F4 (r3 HP     ProLiant        2        
162E)
> > (XEN) ACPI: DSDT CFD43AC0, 30C9 (r1 HP         DSDT        1 INTL
20030228)
> > (XEN) ACPI: FACS CFD43100, 0040
> > (XEN) ACPI: SPCR CFD43140, 0050 (r1 HP     SPCRRBSU        1        
162E)
> > (XEN) ACPI: MCFG CFD431C0, 003C (r1 HP     ProLiant        1          
0)
> > (XEN) ACPI: HPET CFD43200, 0038 (r1 HP     ProLiant        2        
162E)
> > (XEN) ACPI: FFFF CFD43240, 0064 (r2 HP     P61             2        
162E)
> > (XEN) ACPI: SPMI CFD432C0, 0040 (r5 HP     ProLiant        1        
162E)
> > (XEN) ACPI: ERST CFD43300, 01D0 (r1 HP     ProLiant        1        
162E)
> > (XEN) ACPI: APIC CFD43500, 0176 (r1 HP     ProLiant        2          
0)
> > (XEN) ACPI: FFFF CFD43680, 0176 (r1 HP     ProLiant        1        
162E)
> > (XEN) ACPI: BERT CFD43800, 0030 (r1 HP     ProLiant        1        
162E)
> > (XEN) ACPI: HEST CFD43840, 00BC (r1 HP     ProLiant        1        
162E)
> > (XEN) System RAM: 65532MB (67105672kB)
> > (XEN) Domain heap initialised
> > (XEN) Processor #0 6:15 APIC version 20
> > (XEN) Processor #8 6:15 APIC version 20
> > (XEN) Processor #16 6:15 APIC version 20
> > (XEN) Processor #24 6:15 APIC version 20
> > (XEN) Processor #1 6:15 APIC version 20
> > (XEN) Processor #9 6:15 APIC version 20
> > (XEN) Processor #17 6:15 APIC version 20
> > (XEN) Processor #25 6:15 APIC version 20
> > (XEN) Processor #2 6:15 APIC version 20
> > (XEN) Processor #10 6:15 APIC version 20
> > (XEN) Processor #18 6:15 APIC version 20
> > (XEN) Processor #26 6:15 APIC version 20
> > (XEN) Processor #3 6:15 APIC version 20
> > (XEN) Processor #11 6:15 APIC version 20
> > (XEN) Processor #19 6:15 APIC version 20
> > (XEN) Processor #27 6:15 APIC version 20
> > (XEN) IOAPIC[0]: apic_id 1, version 32, address 0xfec00000, GSI 0-23
> > (XEN) IOAPIC[1]: apic_id 2, version 32, address 0xfec80000, GSI 24-47
> > (XEN) IOAPIC[2]: apic_id 3, version 32, address 0xfec81000, GSI 48-71
> > (XEN) IOAPIC[3]: apic_id 4, version 32, address 0xfec81800, GSI 72-95
> > (XEN) Enabling APIC mode:  Phys.  Using 4 I/O APICs
> > (XEN) Using scheduler: SMP Credit Scheduler (credit)
> > (XEN) Detected 2400.128 MHz processor.
> > (XEN) Initing memory sharing.
> > (XEN) VMX: Supported advanced features:
> > (XEN)  - APIC MMIO access virtualisation
> > (XEN)  - APIC TPR shadow
> > (XEN)  - Virtual NMI
> > (XEN)  - MSR direct-access bitmap
> > (XEN) HVM: ASIDs disabled.
> > (XEN) HVM: VMX enabled
> > (XEN) I/O virtualisation disabled
> > (XEN) Total of 16 processors activated.
> > (XEN) ENABLING IO-APIC IRQs
> > (XEN)  -> Using new ACK method
> > (XEN) checking TSC synchronization across 16 CPUs:
> > (XEN) CPU#14 had 3 usecs TSC skew, fixed it up.
> > (XEN) Platform timer is 14.318MHz HPET
> > (XEN) Allocated console ring of 32 KiB.
> > (XEN) Brought up 16 CPUs
> > (XEN) *** LOADING DOMAIN 0 ***
> > (XEN)  Xen  kernel: 64-bit, lsb, compat32
> > (XEN)  Dom0 kernel: 64-bit, PAE, lsb, paddr 0x1000000 -> 0x1708000
> > (XEN) PHYSICAL MEMORY ARRANGEMENT:
> > (XEN)  Dom0 alloc.:   000000083c000000->0000000840000000 (770048
pages
> > to be allocated)
> > (XEN) VIRTUAL MEMORY ARRANGEMENT:
> > (XEN)  Loaded kernel: ffffffff81000000->ffffffff81708000
> > (XEN)  Init. ramdisk: ffffffff81708000->ffffffff81efb000
> > (XEN)  Phys-Mach map: ffffffff81efb000->ffffffff824fb000
> > (XEN)  Start info:    ffffffff824fb000->ffffffff824fb4b4
> > (XEN)  Page tables:   ffffffff824fc000->ffffffff82513000
> > (XEN)  Boot stack:    ffffffff82513000->ffffffff82514000
> > (XEN)  TOTAL:         ffffffff80000000->ffffffff82800000
> > (XEN)  ENTRY ADDRESS: ffffffff81531200
> > (XEN) Dom0 has maximum 16 VCPUs
> > (XEN) Scrubbing Free RAM:
> >
.....................................................................................................................
> >
.....................................................................................................................
> >
.....................................................................................................................
> >
.....................................................................................................................
> >
.....................................................................................................................
> > ................................done.
> > (XEN) Xen trace buffers: disabled
> > (XEN) Std. Loglevel: Errors and warnings
> > (XEN) Guest Loglevel: Errors and warnings
> > (XEN) Xen is relinquishing VGA console.
> > (XEN) *** Serial input -> DOM0 (type ''CTRL-a''
three times to switch input to
> > Xen)
> > (XEN) Freed 176kB init memory.
> > (XEN) XXX plt_overflow: plt_now=5ece12d34128 plt_wrap=5ece12d09306
> > now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5
> > plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854
> > tsc_stamp=e3839fcb0273
> > 
> > 
> 
> 
> 
> 
> 
> > -----Original Message-----
> > From: xen-devel-bounces@lists.xen.org [mailto:xen-devel-
> > bounces@lists.xen.org] On Behalf Of Ian Campbell
> > Sent: Friday, October 26, 2012 3:00 PM
> > To: xen-devel@lists.xen.org
> > Cc: 599161@bugs.debian.org
> > Subject: [Xen-devel] #599161: Xen debug patch for the "clock
shifts by 50
> > minutes" bug.
> > 
> > Hi all,
> > 
> > I''ve BCC''d a number of people who have reported
seeing this bug at various
> > times in the past.
> > 
> > If you can still repro I''d appreciate it if you could give
the patch in
> > http://marc.info/?l=xen-devel&m=135049062216685&w=2 (also
attached) a
> > go and report back success/failure and the output of the debugging
> > messages produced.
> > 
> > Thanks,
> > Ian.
> > 
> > --
> > Ian Campbell
> > Current Noise: Death - Evil Dead
> > 
> > Executive ability is prominent in your make-up.

Jan Beulich

2012-Nov-07 17:07 UTC

head link

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

>>> On 07.11.12 at 11:10, <Philippe.Simonet@swisscom.com> wrote:
> i compiled  a patched hypervisor for Mauro, it is running since many days 
> and the overflow occured, 
> without clock jumps
> 
>> (XEN) XXX plt_overflow: plt_now=5ece12d34128 plt_wrap=5ece12d09306
>> now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5
>> plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854
>> tsc_stamp=e3839fcb0273
i.e. we have (in order of time)

 plt_wrap=5ece12d09306
      now=5ece12d16292
  plt_now=5ece12d34128

which is exactly the inverse order of how things should be (now not
necessarily being in the middle). Nor should plt_now and plt_wrap be
that close together. So far I have no idea how this can be explained.

Jan

Keir Fraser

2012-Nov-07 17:40 UTC

head link

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

On 07/11/2012 13:22, "Ian Campbell" <ijc@hellion.org.uk> wrote:
>>> (XEN) XXX plt_overflow: plt_now=5ece12d34128 plt_wrap=5ece12d09306
>>> now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5
>>> plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854
>>> tsc_stamp=e3839fcb0273
>> 
>> (below is the complete xm dmesg output)
>> 
>> did that help you ? do you need more info ?
> 
> I''ll leave this to Keir (who wrote the debugging patch) to answer
but it
> looks to me like it should be useful!
I''m scratching my head. plt_wrap is earlier than plt_now, which should
be
impossible. plt_stamp64 oddly has low 32 bits identical to new_stamp. That
seems very very improbable!

I wonder whether the overflow handling should just be removed, or made
conditional on a command-line parameter, or on the 32-bit platform counter
being at least somewhat likely to overflow before a softirq occurs -- it
seems lots of systems are using 14MHz HPET, and that gives us a couple of
minutes for the plt_overflow softirq to do its work before overflow occurs.
I think we would notice that outage in other ways. :)

 -- Keir

Jan Beulich

2012-Nov-08 09:39 UTC

head link

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

>>> On 07.11.12 at 18:40, Keir Fraser <keir@xen.org> wrote:
> On 07/11/2012 13:22, "Ian Campbell" <ijc@hellion.org.uk>
wrote:
> 
>>>> (XEN) XXX plt_overflow: plt_now=5ece12d34128
plt_wrap=5ece12d09306
>>>> now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5
>>>> plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854
>>>> tsc_stamp=e3839fcb0273
>>> 
>>> (below is the complete xm dmesg output)
>>> 
>>> did that help you ? do you need more info ?
>> 
>> I''ll leave this to Keir (who wrote the debugging patch) to
answer but it
>> looks to me like it should be useful!
> 
> I''m scratching my head. plt_wrap is earlier than plt_now, which
should be
> impossible. plt_stamp64 oddly has low 32 bits identical to new_stamp. That
> seems very very improbable!
Is it? My understanding was that plt_stamp64 is just a software
extension to the more narrow HW counter, and hence the low
plt_mask bits would always be expected to be identical.

The plt_wrap < plt_now thing of course is entirely unexplainable
to me too: Considering that plt_scale doesn''t change at all post-
boot, apart from memory corruption I could only see an memory
access ordering problem to be the reason (platform_timer_stamp
and/or stime_platform_stamp changing despite platform_timer_lock
being held. So maybe taking a snapshot of all three static values
involved in the calculation in __read_platform_stime() between
acquiring the lock and the first call to __read_platform_stime(),
and printing them together with the "live" values in a second
printk() after the one your original patch added could rule that
out.

But the box doesn''t even seem to be NUMA (of course it also
doesn''t help that the log level was kept restricted - hint, hint,
Philippe), not does there appear to be any S3 cycle or pCPU
bring-up/-down in between...

Philippe, could you clarify again what CPU model(s) this is being
observed on (the long times between individual steps forward
with this problem perhaps warrant repeating the basics each
time, as it''s otherwise quite cumbersome to always look up old
pieces of information).
> I wonder whether the overflow handling should just be removed, or made
> conditional on a command-line parameter, or on the 32-bit platform counter
> being at least somewhat likely to overflow before a softirq occurs -- it
> seems lots of systems are using 14MHz HPET, and that gives us a couple of
> minutes for the plt_overflow softirq to do its work before overflow occurs.
> I think we would notice that outage in other ways. :)
Iirc we added this for a good reason - to cover the, however
unlikely, event of Xen running for very long without preemption.
Presumably most of the cases got fixed meanwhile, and indeed
a wraparound time on the order of minutes should make this
superfluous, but as the case here shows that code did spot a
severe anomaly (whatever that may turn out to be).

Also recall that there are HPET implementations around that tick
at a much higher frequency than 14MHz.

So unless we finally reach the understanding that the code is
flawed, I would rather want to keep it.

Jan

Keir Fraser

2012-Nov-08 10:38 UTC

head link

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

On 08/11/2012 09:39, "Jan Beulich" <JBeulich@suse.com> wrote:
>>>>> (XEN) XXX plt_overflow: plt_now=5ece12d34128
plt_wrap=5ece12d09306
>>>>> now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5
>>>>> plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854
>>>>> tsc_stamp=e3839fcb0273
>>>> 
>>>> (below is the complete xm dmesg output)
>>>> 
>>>> did that help you ? do you need more info ?
>>> 
>>> I''ll leave this to Keir (who wrote the debugging patch) to
answer but it
>>> looks to me like it should be useful!
>> 
>> I''m scratching my head. plt_wrap is earlier than plt_now,
which should be
>> impossible. plt_stamp64 oddly has low 32 bits identical to new_stamp.
That
>> seems very very improbable!
> 
> Is it? My understanding was that plt_stamp64 is just a software
> extension to the more narrow HW counter, and hence the low
> plt_mask bits would always be expected to be identical.
No, plt_stamp is simply the HW counter time at which plt_stamp64 was last
brought up to date. Hence plt_stamp64 is updated as:
 plt_stamp64 += (new_stamp - old_stamp) & plt_mask;

Hence why seeing plt_stamp64&plt_mask == new_stamp is very unexpected!

 -- Keir

Ian Campbell

2012-Nov-08 11:43 UTC

head link

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

On Wed, 2012-11-07 at 17:40 +0000, Keir Fraser wrote:> On 07/11/2012 13:22, "Ian Campbell" <ijc@hellion.org.uk>
wrote:
> 
> >>> (XEN) XXX plt_overflow: plt_now=5ece12d34128
plt_wrap=5ece12d09306
> >>> now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5
> >>> plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854
> >>> tsc_stamp=e3839fcb0273
> >> 
> >> (below is the complete xm dmesg output)
> >> 
> >> did that help you ? do you need more info ?
> > 
> > I''ll leave this to Keir (who wrote the debugging patch) to
answer but it
> > looks to me like it should be useful!
> 
> I''m scratching my head. plt_wrap is earlier than plt_now, which
should be
> impossible.
impossible due to guarantees made by the h/w or by construction in Xen.
There appears to be a certain amount of hardware-specificness to the
issue -- so I''m wondering if maybe there are some platforms whose tsc
is
not as monotonically increasing as it needs to be...
>  plt_stamp64 oddly has low 32 bits identical to new_stamp. That
> seems very very improbable!
Does this code run on all cpus or just one? Is it always the same one?
> I wonder whether the overflow handling should just be removed, or made
> conditional on a command-line parameter, or on the 32-bit platform counter
> being at least somewhat likely to overflow before a softirq occurs -- it
> seems lots of systems are using 14MHz HPET, and that gives us a couple of
> minutes for the plt_overflow softirq to do its work before overflow occurs.
> I think we would notice that outage in other ways. :)
> 
>  -- Keir
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

Keir Fraser

2012-Nov-08 12:54 UTC

head link

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

On 08/11/2012 11:43, "Ian Campbell" <ijc@hellion.org.uk> wrote:
>>> I''ll leave this to Keir (who wrote the debugging patch) to
answer but it
>>> looks to me like it should be useful!
>> 
>> I''m scratching my head. plt_wrap is earlier than plt_now,
which should be
>> impossible.
> 
> impossible due to guarantees made by the h/w or by construction in Xen.
That''s a question, right? By construction in Xen.
> There appears to be a certain amount of hardware-specificness to the
> issue -- so I''m wondering if maybe there are some platforms whose
tsc is
> not as monotonically increasing as it needs to be...
plt_* timestamps are not derived from TSC at all.
>>  plt_stamp64 oddly has low 32 bits identical to new_stamp. That
>> seems very very improbable!
> 
> Does this code run on all cpus or just one? Is it always the same one?
Always cpu0.

 -- Keir

<Philippe.Simonet@swisscom.com>

2012-Nov-08 13:47 UTC

head link

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

Hi Mauro, 

that''s a question for you : 
> Philippe, could you clarify again what CPU model(s) this is being observed
on
> (the long times between individual steps forward with this problem perhaps
> warrant repeating the basics each time, as it''s otherwise quite
cumbersome
> to always look up old pieces of information).
can you provide this information ? 
	cat /proc/cpuinfo 	
	cat /proc/meminfo
	hardware information (manufacturer, model, urls, ...)

Thanks, Philippe

> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Thursday, November 08, 2012 10:40 AM
> To: Simonet Philippe, ITS-OUS-OP-IFM-NW-IPE; Keir Fraser
> Cc: 599161@bugs.debian.org; mrsanna1@gmail.com; Ian Campbell; xen-
> devel@lists.xen.org
> Subject: Re: [Xen-devel] #599161: Xen debug patch for the "clock
shifts by 50
> minutes" bug.
> 
> >>> On 07.11.12 at 18:40, Keir Fraser <keir@xen.org> wrote:
> > On 07/11/2012 13:22, "Ian Campbell"
<ijc@hellion.org.uk> wrote:
> >
> >>>> (XEN) XXX plt_overflow: plt_now=5ece12d34128
> plt_wrap=5ece12d09306
> >>>> now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5
> >>>> plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854
> >>>> tsc_stamp=e3839fcb0273
> >>>
> >>> (below is the complete xm dmesg output)
> >>>
> >>> did that help you ? do you need more info ?
> >>
> >> I''ll leave this to Keir (who wrote the debugging patch)
to answer but
> >> it looks to me like it should be useful!
> >
> > I''m scratching my head. plt_wrap is earlier than plt_now,
which should
> > be impossible. plt_stamp64 oddly has low 32 bits identical to
> > new_stamp. That seems very very improbable!
> 
> Is it? My understanding was that plt_stamp64 is just a software extension
to
> the more narrow HW counter, and hence the low plt_mask bits would always
> be expected to be identical.
> 
> The plt_wrap < plt_now thing of course is entirely unexplainable to me
too:
> Considering that plt_scale doesn''t change at all post- boot, apart
from
> memory corruption I could only see an memory access ordering problem to
> be the reason (platform_timer_stamp and/or stime_platform_stamp
> changing despite platform_timer_lock being held. So maybe taking a
> snapshot of all three static values involved in the calculation in
> __read_platform_stime() between acquiring the lock and the first call to
> __read_platform_stime(), and printing them together with the
"live" values
> in a second
> printk() after the one your original patch added could rule that out.
> 
> But the box doesn''t even seem to be NUMA (of course it also
doesn''t help
> that the log level was kept restricted - hint, hint, Philippe), not does
there
> appear to be any S3 cycle or pCPU bring-up/-down in between...
> 
> Philippe, could you clarify again what CPU model(s) this is being observed
on
> (the long times between individual steps forward with this problem perhaps
> warrant repeating the basics each time, as it''s otherwise quite
cumbersome
> to always look up old pieces of information).
> 
> > I wonder whether the overflow handling should just be removed, or made
> > conditional on a command-line parameter, or on the 32-bit platform
> > counter being at least somewhat likely to overflow before a softirq
> > occurs -- it seems lots of systems are using 14MHz HPET, and that
> > gives us a couple of minutes for the plt_overflow softirq to do its
work
> before overflow occurs.
> > I think we would notice that outage in other ways. :)
> 
> Iirc we added this for a good reason - to cover the, however unlikely,
event
> of Xen running for very long without preemption.
> Presumably most of the cases got fixed meanwhile, and indeed a
> wraparound time on the order of minutes should make this superfluous, but
> as the case here shows that code did spot a severe anomaly (whatever that
> may turn out to be).
> 
> Also recall that there are HPET implementations around that tick at a much
> higher frequency than 14MHz.
> 
> So unless we finally reach the understanding that the code is flawed, I
would
> rather want to keep it.
> 
> Jan

Jan Beulich

2012-Nov-08 13:53 UTC

head link

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

>>> On 08.11.12 at 11:38, Keir Fraser <keir@xen.org> wrote:
> On 08/11/2012 09:39, "Jan Beulich" <JBeulich@suse.com>
wrote:
> 
>>>>>> (XEN) XXX plt_overflow: plt_now=5ece12d34128
plt_wrap=5ece12d09306
>>>>>> now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5
>>>>>> plt_stamp64=15b800366a5 plt_mask=ffffffff
tsc=e3839fd23854
>>>>>> tsc_stamp=e3839fcb0273
>>>>> 
>>>>> (below is the complete xm dmesg output)
>>>>> 
>>>>> did that help you ? do you need more info ?
>>>> 
>>>> I''ll leave this to Keir (who wrote the debugging
patch) to answer but it
>>>> looks to me like it should be useful!
>>> 
>>> I''m scratching my head. plt_wrap is earlier than plt_now,
which should be
>>> impossible. plt_stamp64 oddly has low 32 bits identical to
new_stamp. That
>>> seems very very improbable!
>> 
>> Is it? My understanding was that plt_stamp64 is just a software
>> extension to the more narrow HW counter, and hence the low
>> plt_mask bits would always be expected to be identical.
> 
> No, plt_stamp is simply the HW counter time at which plt_stamp64 was last
> brought up to date. Hence plt_stamp64 is updated as:
>  plt_stamp64 += (new_stamp - old_stamp) & plt_mask;
I concur: Given that what old_stamp is here was new_stamp for
the last update, we should simply have

stamp64 = s0 + (s1 - s0) + (s2 - s1) + ...

(of course with the mask applied on each addend), which (for the
low bits) is the same as just new_stamp.

Jan

Keir Fraser

2012-Nov-08 14:04 UTC

head link

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

On 08/11/2012 13:53, "Jan Beulich" <JBeulich@suse.com> wrote:
>>> Is it? My understanding was that plt_stamp64 is just a software
>>> extension to the more narrow HW counter, and hence the low
>>> plt_mask bits would always be expected to be identical.
>> 
>> No, plt_stamp is simply the HW counter time at which plt_stamp64 was
last
>> brought up to date. Hence plt_stamp64 is updated as:
>>  plt_stamp64 += (new_stamp - old_stamp) & plt_mask;
> 
> I concur
Well, no, you don''t really. You''re about to point out the flaw
in my
reasoning...
> : Given that what old_stamp is here was new_stamp for
> the last update, we should simply have
> 
> stamp64 = s0 + (s1 - s0) + (s2 - s1) + ...
> 
> (of course with the mask applied on each addend), which (for the
> low bits) is the same as just new_stamp.
Very good point. Silly me. Then the observed value of plt_stamp64 makes
perfect sense.

 -- Keir

Keir Fraser

2012-Nov-08 14:20 UTC

head link

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

On 07/11/2012 17:40, "Keir Fraser" <keir@xen.org> wrote:
> On 07/11/2012 13:22, "Ian Campbell" <ijc@hellion.org.uk>
wrote:
> 
>>>> (XEN) XXX plt_overflow: plt_now=5ece12d34128
plt_wrap=5ece12d09306
>>>> now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5
>>>> plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854
>>>> tsc_stamp=e3839fcb0273
>>> 
>>> (below is the complete xm dmesg output)
>>> 
>>> did that help you ? do you need more info ?
>> 
>> I''ll leave this to Keir (who wrote the debugging patch) to
answer but it
>> looks to me like it should be useful!
> 
> I''m scratching my head. plt_wrap is earlier than plt_now, which
should be
> impossible. plt_stamp64 oddly has low 32 bits identical to new_stamp. That
> seems very very improbable!
Jan has pointed out that the value of plt_stamp64 makes perfect sense, and
will in fact always have low 32 bits identical to new_stamp. At least that
is explained.

So, the question is then why plt_now (== __read_platform_stime(15b800366a5))
is greater than plt_wrap (== __read_platform_stime(15c800366a5)). Perhaps
the scale_delta() logic is failing for some reason, but we do use it a lot
elsewhere!

 -- Keir
> I wonder whether the overflow handling should just be removed, or made
> conditional on a command-line parameter, or on the 32-bit platform counter
> being at least somewhat likely to overflow before a softirq occurs -- it
> seems lots of systems are using 14MHz HPET, and that gives us a couple of
> minutes for the plt_overflow softirq to do its work before overflow occurs.
> I think we would notice that outage in other ways. :)
> 
>  -- Keir
> 
>

Ian Campbell

2012-Nov-08 14:28 UTC

head link

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

On Thu, 2012-11-08 at 12:54 +0000, Keir Fraser wrote:> On 08/11/2012 11:43, "Ian Campbell" <ijc@hellion.org.uk>
wrote:
> 
> >>> I''ll leave this to Keir (who wrote the debugging
patch) to answer but it
> >>> looks to me like it should be useful!
> >> 
> >> I''m scratching my head. plt_wrap is earlier than plt_now,
which should be
> >> impossible.
> > 
> > impossible due to guarantees made by the h/w or by construction in
Xen.
> 
> That''s a question, right? 
Yes, sorry.
> By construction in Xen.
> 
> > There appears to be a certain amount of hardware-specificness to the
> > issue -- so I''m wondering if maybe there are some platforms
whose tsc is
> > not as monotonically increasing as it needs to be...
> 
> plt_* timestamps are not derived from TSC at all.
I see, rather it is derived from the platform_timesource which could be
HPET, pmtimer, pit etc but in this case (according to the provided xm
dmesg) appears to be a 14MHz HPET.

So I guess s/tsc/HPET/ in my original thought...
> >>  plt_stamp64 oddly has low 32 bits identical to new_stamp. That
> >> seems very very improbable!
> > 
> > Does this code run on all cpus or just one? Is it always the same one?
> 
> Always cpu0.
So it''s not cross cpu drift then.

Ian.

Ian Campbell

2012-Nov-08 14:29 UTC

head link

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

On Thu, 2012-11-08 at 13:47 +0000, Philippe.Simonet@swisscom.com
wrote:> Hi Mauro, 
> 
> that''s a question for you : 
I think Jan was asking for information relating to the system you saw
this on -- or are you working on the same systems as Mauro?

Of course additional information from Mauro would be useful too in order
to help spotting any patterns.
> > Philippe, could you clarify again what CPU model(s) this is being
observed on
> > (the long times between individual steps forward with this problem
perhaps
> > warrant repeating the basics each time, as it''s otherwise
quite cumbersome
> > to always look up old pieces of information).
> 
> can you provide this information ? 
> 	cat /proc/cpuinfo 	
> 	cat /proc/meminfo
> 	hardware information (manufacturer, model, urls, ...)
> 
> Thanks, Philippe
> 
> 
> > -----Original Message-----
> > From: Jan Beulich [mailto:JBeulich@suse.com]
> > Sent: Thursday, November 08, 2012 10:40 AM
> > To: Simonet Philippe, ITS-OUS-OP-IFM-NW-IPE; Keir Fraser
> > Cc: 599161@bugs.debian.org; mrsanna1@gmail.com; Ian Campbell; xen-
> > devel@lists.xen.org
> > Subject: Re: [Xen-devel] #599161: Xen debug patch for the "clock
shifts by 50
> > minutes" bug.
> > 
> > >>> On 07.11.12 at 18:40, Keir Fraser <keir@xen.org>
wrote:
> > > On 07/11/2012 13:22, "Ian Campbell"
<ijc@hellion.org.uk> wrote:
> > >
> > >>>> (XEN) XXX plt_overflow: plt_now=5ece12d34128
> > plt_wrap=5ece12d09306
> > >>>> now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5
> > >>>> plt_stamp64=15b800366a5 plt_mask=ffffffff
tsc=e3839fd23854
> > >>>> tsc_stamp=e3839fcb0273
> > >>>
> > >>> (below is the complete xm dmesg output)
> > >>>
> > >>> did that help you ? do you need more info ?
> > >>
> > >> I''ll leave this to Keir (who wrote the debugging
patch) to answer but
> > >> it looks to me like it should be useful!
> > >
> > > I''m scratching my head. plt_wrap is earlier than
plt_now, which should
> > > be impossible. plt_stamp64 oddly has low 32 bits identical to
> > > new_stamp. That seems very very improbable!
> > 
> > Is it? My understanding was that plt_stamp64 is just a software
extension to
> > the more narrow HW counter, and hence the low plt_mask bits would
always
> > be expected to be identical.
> > 
> > The plt_wrap < plt_now thing of course is entirely unexplainable to
me too:
> > Considering that plt_scale doesn''t change at all post- boot,
apart from
> > memory corruption I could only see an memory access ordering problem
to
> > be the reason (platform_timer_stamp and/or stime_platform_stamp
> > changing despite platform_timer_lock being held. So maybe taking a
> > snapshot of all three static values involved in the calculation in
> > __read_platform_stime() between acquiring the lock and the first call
to
> > __read_platform_stime(), and printing them together with the
"live" values
> > in a second
> > printk() after the one your original patch added could rule that out.
> > 
> > But the box doesn''t even seem to be NUMA (of course it also
doesn''t help
> > that the log level was kept restricted - hint, hint, Philippe), not
does there
> > appear to be any S3 cycle or pCPU bring-up/-down in between...
> > 
> > Philippe, could you clarify again what CPU model(s) this is being
observed on
> > (the long times between individual steps forward with this problem
perhaps
> > warrant repeating the basics each time, as it''s otherwise
quite cumbersome
> > to always look up old pieces of information).
> > 
> > > I wonder whether the overflow handling should just be removed, or
made
> > > conditional on a command-line parameter, or on the 32-bit
platform
> > > counter being at least somewhat likely to overflow before a
softirq
> > > occurs -- it seems lots of systems are using 14MHz HPET, and that
> > > gives us a couple of minutes for the plt_overflow softirq to do
its work
> > before overflow occurs.
> > > I think we would notice that outage in other ways. :)
> > 
> > Iirc we added this for a good reason - to cover the, however unlikely,
event
> > of Xen running for very long without preemption.
> > Presumably most of the cases got fixed meanwhile, and indeed a
> > wraparound time on the order of minutes should make this superfluous,
but
> > as the case here shows that code did spot a severe anomaly (whatever
that
> > may turn out to be).
> > 
> > Also recall that there are HPET implementations around that tick at a
much
> > higher frequency than 14MHz.
> > 
> > So unless we finally reach the understanding that the code is flawed,
I would
> > rather want to keep it.
> > 
> > Jan
> 
>

Keir Fraser

2012-Nov-08 14:34 UTC

head link

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

On 08/11/2012 14:28, "Ian Campbell" <ijc@hellion.org.uk> wrote:
>>> There appears to be a certain amount of hardware-specificness to
the
>>> issue -- so I''m wondering if maybe there are some
platforms whose tsc is
>>> not as monotonically increasing as it needs to be...
>> 
>> plt_* timestamps are not derived from TSC at all.
> 
> I see, rather it is derived from the platform_timesource which could be
> HPET, pmtimer, pit etc but in this case (according to the provided xm
> dmesg) appears to be a 14MHz HPET.
> 
> So I guess s/tsc/HPET/ in my original thought...
In fact plt_now and plt_wrap are both derived from the same value of
plt_stamp64. One is derived from it directly, and the other from
plt_stamp64+plt_mask+1 (== plt_stamp64+(1<<32)).

 -- Keir

Mauro

2012-Nov-08 15:21 UTC

head link

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

On 8 November 2012 14:47,  <Philippe.Simonet@swisscom.com>
wrote:> Hi Mauro,
>
> that''s a question for you :
>
>> Philippe, could you clarify again what CPU model(s) this is being
observed on
>> (the long times between individual steps forward with this problem
perhaps
>> warrant repeating the basics each time, as it''s otherwise
quite cumbersome
>> to always look up old pieces of information).
>
> can you provide this information ?
>         cat /proc/cpuinfo
>         cat /proc/meminfo
>         hardware information (manufacturer, model, urls, ...)
>
cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           E7330  @ 2.40GHz
stepping        : 11
cpu MHz         : 2400.128
cache size      : 3072 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov
pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc
rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm
bogomips        : 4800.25
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           E7330  @ 2.40GHz
stepping        : 11
cpu MHz         : 2400.128
cache size      : 3072 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov
pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc
rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm
bogomips        : 4800.25
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 2
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           E7330  @ 2.40GHz
stepping        : 11
cpu MHz         : 2400.128
cache size      : 3072 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov
pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc
rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm
bogomips        : 4800.25
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 3
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           E7330  @ 2.40GHz
stepping        : 11
cpu MHz         : 2400.128
cache size      : 3072 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov
pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc
rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm
bogomips        : 4800.25
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 4
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           E7330  @ 2.40GHz
stepping        : 11
cpu MHz         : 2400.128
cache size      : 3072 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov
pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc
rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm
bogomips        : 4800.25
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 5
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           E7330  @ 2.40GHz
stepping        : 11
cpu MHz         : 2400.128
cache size      : 3072 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov
pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc
rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm
bogomips        : 4800.25
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 6
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           E7330  @ 2.40GHz
stepping        : 11
cpu MHz         : 2400.128
cache size      : 3072 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov
pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc
rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm
bogomips        : 4800.25
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           E7330  @ 2.40GHz
stepping        : 11
cpu MHz         : 2400.128
cache size      : 3072 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov
pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc
rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm
bogomips        : 4800.25
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 8
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           E7330  @ 2.40GHz
stepping        : 11
cpu MHz         : 2400.128
cache size      : 3072 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov
pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc
rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm
bogomips        : 4800.25
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 9
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           E7330  @ 2.40GHz
stepping        : 11
cpu MHz         : 2400.128
cache size      : 3072 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov
pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc
rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm
bogomips        : 4800.25
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 10
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           E7330  @ 2.40GHz
stepping        : 11
cpu MHz         : 2400.128
cache size      : 3072 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov
pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc
rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm
bogomips        : 4800.25
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 11
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           E7330  @ 2.40GHz
stepping        : 11
cpu MHz         : 2400.128
cache size      : 3072 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov
pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc
rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm
bogomips        : 4800.25
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 12
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           E7330  @ 2.40GHz
stepping        : 11
cpu MHz         : 2400.128
cache size      : 3072 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov
pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc
rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm
bogomips        : 4800.25
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 13
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           E7330  @ 2.40GHz
stepping        : 11
cpu MHz         : 2400.128
cache size      : 3072 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov
pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc
rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm
bogomips        : 4800.25
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 14
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           E7330  @ 2.40GHz
stepping        : 11
cpu MHz         : 2400.128
cache size      : 3072 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov
pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc
rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm
bogomips        : 4800.25
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 15
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           E7330  @ 2.40GHz
stepping        : 11
cpu MHz         : 2400.128
cache size      : 3072 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu de tsc msr pae mce cx8 apic sep mtrr mca cmov
pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc
rep_good aperfmperf pni est ssse3 cx16 hypervisor lahf_lm
bogomips        : 4800.25
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:


cat /proc/meminfo
MemTotal:        3127132 kB
MemFree:         2697364 kB
Buffers:           65816 kB
Cached:            62188 kB
SwapCached:            0 kB
Active:           117156 kB
Inactive:          46840 kB
Active(anon):      28668 kB
Inactive(anon):    16360 kB
Active(file):      88488 kB
Inactive(file):    30480 kB
Unevictable:       18944 kB
Mlocked:           18944 kB
SwapTotal:        974840 kB
SwapFree:         974840 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:         54936 kB
Mapped:            14124 kB
Shmem:               508 kB
Slab:              66072 kB
SReclaimable:      19732 kB
SUnreclaim:        46340 kB
KernelStack:        4016 kB
PageTables:         4216 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     2538404 kB
Committed_AS:     202468 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      289904 kB
VmallocChunk:   34359384628 kB
HardwareCorrupted:     0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:     3145728 kB
DirectMap2M:           0 kB


note that 3G is the mem reserved to dom0, the total amount of mem in
the machine is 64G.


The machine is a HP Proliant DL 580 G5,
http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?objectID=c01154573&lang=en&cc=us&taskId=&prodSeriesId=3454575&prodTypeId=15351

Thanks to all for the work.

Tim Deegan

2012-Nov-08 16:45 UTC

head link

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

At 09:39 +0000 on 08 Nov (1352367592), Jan Beulich
wrote:> The plt_wrap < plt_now thing of course is entirely unexplainable
> to me too: Considering that plt_scale doesn''t change at all post-
> boot, apart from memory corruption I could only see an memory
> access ordering problem to be the reason (platform_timer_stamp
> and/or stime_platform_stamp changing despite platform_timer_lock
> being held. So maybe taking a snapshot of all three static values
> involved in the calculation in __read_platform_stime() between
> acquiring the lock and the first call to __read_platform_stime(),
> and printing them together with the "live" values in a second
> printk() after the one your original patch added could rule that
> out.
>  
> But the box doesn''t even seem to be NUMA (of course it also
> doesn''t help that the log level was kept restricted - hint, hint,
> Philippe), not does there appear to be any S3 cycle or pCPU
> bring-up/-down in between...
S3 looks like it might be a culprit, since resume_platform_timer()
clobbers plt_stamp64 without taking the platform_timer_lock.  But both
the S3 resume code and the plt_overflow timer should only ever run on
CPU 0, so even that should be safe (unless continue_hypercall_on_cpu()
is broken...)

Definitely having loglvl=all would have helped here, to eliminate S3
from our enquiries.
> > I wonder whether the overflow handling should just be removed, or made
> > conditional on a command-line parameter, or on the 32-bit platform
counter
> > being at least somewhat likely to overflow before a softirq occurs --
it
> > seems lots of systems are using 14MHz HPET, and that gives us a couple
of
> > minutes for the plt_overflow softirq to do its work before overflow
occurs.
> > I think we would notice that outage in other ways. :)
> 
> Iirc we added this for a good reason - to cover the, however
> unlikely, event of Xen running for very long without preemption.
> Presumably most of the cases got fixed meanwhile, and indeed
> a wraparound time on the order of minutes should make this
> superfluous, but as the case here shows that code did spot a
> severe anomaly (whatever that may turn out to be).
ISTR when this code went in we were dealing with a timer that had a
period of about 4 seconds (ACPI PMTIMER?).  It might well be OTT for the
HPET, but if there''s something weird going on I''d like to
track it down
while we have some sort of a handle on it.

Tim.

Keir Fraser

2012-Nov-08 16:53 UTC

head link

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

On 08/11/2012 16:45, "Tim Deegan" <tim@xen.org> wrote:
>>> I wonder whether the overflow handling should just be removed, or
made
>>> conditional on a command-line parameter, or on the 32-bit platform
counter
>>> being at least somewhat likely to overflow before a softirq occurs
-- it
>>> seems lots of systems are using 14MHz HPET, and that gives us a
couple of
>>> minutes for the plt_overflow softirq to do its work before overflow
occurs.
>>> I think we would notice that outage in other ways. :)
>> 
>> Iirc we added this for a good reason - to cover the, however
>> unlikely, event of Xen running for very long without preemption.
>> Presumably most of the cases got fixed meanwhile, and indeed
>> a wraparound time on the order of minutes should make this
>> superfluous, but as the case here shows that code did spot a
>> severe anomaly (whatever that may turn out to be).
> 
> ISTR when this code went in we were dealing with a timer that had a
> period of about 4 seconds (ACPI PMTIMER?).  It might well be OTT for the
> HPET, but if there''s something weird going on I''d like to
track it down
> while we have some sort of a handle on it.
It must have been the PMTIMER. It''s the only counter narrower than 32
bits
(legacy PIT we simulate as a 32-bit counter behind the scenes).

 -- Keir

<Philippe.Simonet@swisscom.com>

2012-Nov-09 09:05 UTC

head link

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

> -----Original Message-----
> From: Ian Campbell [mailto:ijc@hellion.org.uk]
> Sent: Thursday, November 08, 2012 3:29 PM
> To: Simonet Philippe, ITS-OUS-OP-IFM-NW-IPE
> Cc: mrsanna1@gmail.com; 599161@bugs.debian.org; xen-
> devel@lists.xen.org; keir@xen.org; JBeulich@suse.com
> Subject: Re: [Xen-devel] #599161: Xen debug patch for the "clock
shifts by 50 minutes" bug.
> 
> 
> I think Jan was asking for information relating to the system you saw this
on -
> - or are you working on the same systems as Mauro?
oops, excuse me, here is a description : I have the problem on 4 systems, all
with same hardware.
the problem occured  on each system, 1 time each 2 month in average. since
January 2012, I decided to reboot them all monthly,
and the clock jump occurred only once in February ...

SYSTEM : 		HP ProLiant DL385 G7, with 2 * AMD Processor 6174 (12 cores) = 24
cores, 16 GB MEMORY
XEN    			 (XEN) Xen version 4.0.1 (Debian 4.0.1-5.4) (ultrotter@debian.org)
(gcc version 4.4.5 (Debian 4.4.5-8) ) Sat Sep  8 19:15:46 UTC 2012
DOM0			Linux 2.6.32-5-xen-amd64 #1 SMP Sun Sep 23 13:49:30 UTC 2012 x86_64
GNU/Linux
CPU 			
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 16
model           : 9
model name      : AMD Opteron(tm) Processor 6174
stepping        : 1
cpu MHz         : 3791872.477
cache size      : 512 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu de tsc msr pae mce cx8 apic mtrr mca cmov pat clflush mmx
fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow constant_tsc
rep_good nonstop_tsc extd_apicid amd_dcm pni cx16 popcnt hypervisor lahf_lm
cmp_legacy extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch nodeid_msr
bogomips        : 4400.17
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate
> 
> Of course additional information from Mauro would be useful too in order to
> help spotting any patterns.
> 
> > > Philippe, could you clarify again what CPU model(s) this is being
> > > observed on (the long times between individual steps forward with
> > > this problem perhaps warrant repeating the basics each time, as
it''s
> > > otherwise quite cumbersome to always look up old pieces of
> information).
> >
> > can you provide this information ?
> > 	cat /proc/cpuinfo
> > 	cat /proc/meminfo
> > 	hardware information (manufacturer, model, urls, ...)
> >
> > Thanks, Philippe
> >
> >
> > > -----Original Message-----
> > > From: Jan Beulich [mailto:JBeulich@suse.com]
> > > Sent: Thursday, November 08, 2012 10:40 AM
> > > To: Simonet Philippe, ITS-OUS-OP-IFM-NW-IPE; Keir Fraser
> > > Cc: 599161@bugs.debian.org; mrsanna1@gmail.com; Ian Campbell;
xen-
> > > devel@lists.xen.org
> > > Subject: Re: [Xen-devel] #599161: Xen debug patch for the
"clock
> > > shifts by 50 minutes" bug.
> > >
> > > >>> On 07.11.12 at 18:40, Keir Fraser
<keir@xen.org> wrote:
> > > > On 07/11/2012 13:22, "Ian Campbell"
<ijc@hellion.org.uk> wrote:
> > > >
> > > >>>> (XEN) XXX plt_overflow: plt_now=5ece12d34128
> > > plt_wrap=5ece12d09306
> > > >>>> now=5ece12d16292 old_stamp=35c7c
new_stamp=800366a5
> > > >>>> plt_stamp64=15b800366a5 plt_mask=ffffffff
tsc=e3839fd23854
> > > >>>> tsc_stamp=e3839fcb0273
> > > >>>
> > > >>> (below is the complete xm dmesg output)
> > > >>>
> > > >>> did that help you ? do you need more info ?
> > > >>
> > > >> I''ll leave this to Keir (who wrote the
debugging patch) to answer
> > > >> but it looks to me like it should be useful!
> > > >
> > > > I''m scratching my head. plt_wrap is earlier than
plt_now, which
> > > > should be impossible. plt_stamp64 oddly has low 32 bits
identical
> > > > to new_stamp. That seems very very improbable!
> > >
> > > Is it? My understanding was that plt_stamp64 is just a software
> > > extension to the more narrow HW counter, and hence the low
plt_mask
> > > bits would always be expected to be identical.
> > >
> > > The plt_wrap < plt_now thing of course is entirely
unexplainable to me
> too:
> > > Considering that plt_scale doesn''t change at all post-
boot, apart
> > > from memory corruption I could only see an memory access ordering
> > > problem to be the reason (platform_timer_stamp and/or
> > > stime_platform_stamp changing despite platform_timer_lock being
> > > held. So maybe taking a snapshot of all three static values
involved
> > > in the calculation in
> > > __read_platform_stime() between acquiring the lock and the first
> > > call to __read_platform_stime(), and printing them together with
the
> > > "live" values in a second
> > > printk() after the one your original patch added could rule that
out.
> > >
> > > But the box doesn''t even seem to be NUMA (of course it
also doesn''t
> > > help that the log level was kept restricted - hint, hint,
Philippe),
> > > not does there appear to be any S3 cycle or pCPU bring-up/-down
in
> between...
> > >
> > > Philippe, could you clarify again what CPU model(s) this is being
> > > observed on (the long times between individual steps forward with
> > > this problem perhaps warrant repeating the basics each time, as
it''s
> > > otherwise quite cumbersome to always look up old pieces of
> information).
> > >
> > > > I wonder whether the overflow handling should just be
removed, or
> > > > made conditional on a command-line parameter, or on the
32-bit
> > > > platform counter being at least somewhat likely to overflow
before
> > > > a softirq occurs -- it seems lots of systems are using 14MHz
HPET,
> > > > and that gives us a couple of minutes for the plt_overflow
softirq
> > > > to do its work
> > > before overflow occurs.
> > > > I think we would notice that outage in other ways. :)
> > >
> > > Iirc we added this for a good reason - to cover the, however
> > > unlikely, event of Xen running for very long without preemption.
> > > Presumably most of the cases got fixed meanwhile, and indeed a
> > > wraparound time on the order of minutes should make this
> > > superfluous, but as the case here shows that code did spot a
severe
> > > anomaly (whatever that may turn out to be).
> > >
> > > Also recall that there are HPET implementations around that tick
at
> > > a much higher frequency than 14MHz.
> > >
> > > So unless we finally reach the understanding that the code is
> > > flawed, I would rather want to keep it.
> > >
> > > Jan
> >
> >
>

Jan Beulich

2012-Nov-09 09:47 UTC

head link

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

>>> On 09.11.12 at 10:05, <Philippe.Simonet@swisscom.com> wrote:
> oops, excuse me, here is a description : I have the problem on 4 systems, 
> all with same hardware.
> the problem occured  on each system, 1 time each 2 month in average. since 
> January 2012, I decided to reboot them all monthly, 
> and the clock jump occurred only once in February ...
> 
> SYSTEM : 		HP ProLiant DL385 G7, with 2 * AMD Processor 6174 (12 cores) =
24
> cores, 16 GB MEMORY
> XEN    			 (XEN) Xen version 4.0.1 (Debian 4.0.1-5.4)
(ultrotter@debian.org)
> (gcc version 4.4.5 (Debian 4.4.5-8) ) Sat Sep  8 19:15:46 UTC 2012
> DOM0			Linux 2.6.32-5-xen-amd64 #1 SMP Sun Sep 23 13:49:30 UTC 2012 x86_64 
> GNU/Linux
> CPU 			
> processor       : 0
> vendor_id       : AuthenticAMD
> cpu family      : 16
> model           : 9
> model name      : AMD Opteron(tm) Processor 6174
Huh - so we have the problem on even different vendor CPUs (as
Mauro''s are Intel ones). But I take it that you haven''t see an
event
yet with the debugging patch?

Plus, what''s puzzling me a little too - before the occurrence of the
event on Mauro''s system, I was under the impression that this
requires quite a bit of uptime. Yet the event he observed occurred
early on the second day after boot afaict.

Jan

Mauro

2012-Nov-09 09:54 UTC

head link

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

On 9 November 2012 10:47, Jan Beulich <JBeulich@suse.com>
wrote:>>>> On 09.11.12 at 10:05, <Philippe.Simonet@swisscom.com>
wrote:
>> oops, excuse me, here is a description : I have the problem on 4
systems,
>> all with same hardware.
>> the problem occured  on each system, 1 time each 2 month in average.
since
>> January 2012, I decided to reboot them all monthly,
>> and the clock jump occurred only once in February ...
>>
>> SYSTEM :              HP ProLiant DL385 G7, with 2 * AMD Processor 6174
(12 cores) = 24
>> cores, 16 GB MEMORY
>> XEN                            (XEN) Xen version 4.0.1 (Debian
4.0.1-5.4) (ultrotter@debian.org)
>> (gcc version 4.4.5 (Debian 4.4.5-8) ) Sat Sep  8 19:15:46 UTC 2012
>> DOM0                  Linux 2.6.32-5-xen-amd64 #1 SMP Sun Sep 23
13:49:30 UTC 2012 x86_64
>> GNU/Linux
>> CPU
>> processor       : 0
>> vendor_id       : AuthenticAMD
>> cpu family      : 16
>> model           : 9
>> model name      : AMD Opteron(tm) Processor 6174
>
> Huh - so we have the problem on even different vendor CPUs (as
> Mauro''s are Intel ones). But I take it that you haven''t
see an event
> yet with the debugging patch?
>
> Plus, what''s puzzling me a little too - before the occurrence of
the
> event on Mauro''s system, I was under the impression that this
> requires quite a bit of uptime. Yet the event he observed occurred
> early on the second day after boot afaict.
Before the patch clock jumps on my systems occurred about once or twice at week.

Jan Beulich

2012-Nov-13 10:14 UTC

head link

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

>>> On 09.11.12 at 10:05, <Philippe.Simonet@swisscom.com> wrote:
Since it looks like this got stalled again, attached is a slightly
extended version of Keir''s debugging patch, allowing to rule out
any inconsistencies of the globals between the first and second
instances of the two invocations of __read_platform_stime().

Should the numbers printed turn out identical between the two
invocations and identical to the boot time determined values, then
I''m afraid I''m out of explanations as well as debugging
suggestions.

Please remember to add "loglvl=all" to the hypervisor command
line.

The patch is against a 4.0.3 based tree I had still lying around, so
I hope it''ll apply cleanly to your 4.0.1 based one.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Jan Beulich

2012-Nov-26 10:20 UTC

head link

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

>>> On 24.11.12 at 13:06, Mauro <mrsanna1@gmail.com> wrote:
> (XEN) XXX plt_overflow: plt_now=569ed0c10fa3 plt_wrap=569ed0bd2ac3
> now=569ed0bf15a4 old_stamp=36f62 new_stamp=800377ce
> plt_stamp64=13d800377ce plt_mask=ffffffff tsc=cfdf69d72c82
> tsc_stamp=cfdf69ccc91d
> (XEN) XXX pts=13d80036988/13d80036988 sps=569ed0bd2ac3/569ed0bd2ac3
> (XEN) XXX ps=7:8baebcb8/7:8baebcb8
I.e. plt_wrap == stime_platform_stamp, which would imply
plt_stamp64 + plt_mask + 1 == platform_timer_stamp, which
visibly isn''t the case.

Having to consider mis-compilation of some sort at this point,
can you please attach the xen-syms matching the xen.gz from
the run above?

The only other alternative would appear to be register or stack
corruption (which of the two might also be visible from the
xen-syms, so it would be needed for that determination too). As
the code in question runs with interrupts disabled, the only
sources for that would be NMI, MCE, or SMI. Adding logging of
instances of the former two would be possible, but otoh the
entry/exit code of them is uniform with other exceptions and
hence can pretty much be excluded to corrupt anything.

Jan

Mauro

2012-Nov-26 10:28 UTC

head link

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

On 26 November 2012 11:20, Jan Beulich <JBeulich@suse.com>
wrote:>>>> On 24.11.12 at 13:06, Mauro <mrsanna1@gmail.com> wrote:
>> (XEN) XXX plt_overflow: plt_now=569ed0c10fa3 plt_wrap=569ed0bd2ac3
>> now=569ed0bf15a4 old_stamp=36f62 new_stamp=800377ce
>> plt_stamp64=13d800377ce plt_mask=ffffffff tsc=cfdf69d72c82
>> tsc_stamp=cfdf69ccc91d
>> (XEN) XXX pts=13d80036988/13d80036988 sps=569ed0bd2ac3/569ed0bd2ac3
>> (XEN) XXX ps=7:8baebcb8/7:8baebcb8
>
> I.e. plt_wrap == stime_platform_stamp, which would imply
> plt_stamp64 + plt_mask + 1 == platform_timer_stamp, which
> visibly isn''t the case.
>
> Having to consider mis-compilation of some sort at this point,
> can you please attach the xen-syms matching the xen.gz from
> the run above?
how can I do?

Ian Campbell

2012-Nov-26 10:40 UTC

head link

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

On Mon, 2012-11-26 at 11:28 +0100, Mauro wrote:> On 26 November 2012 11:20, Jan Beulich <JBeulich@suse.com> wrote:
> >>>> On 24.11.12 at 13:06, Mauro <mrsanna1@gmail.com>
wrote:
> >> (XEN) XXX plt_overflow: plt_now=569ed0c10fa3 plt_wrap=569ed0bd2ac3
> >> now=569ed0bf15a4 old_stamp=36f62 new_stamp=800377ce
> >> plt_stamp64=13d800377ce plt_mask=ffffffff tsc=cfdf69d72c82
> >> tsc_stamp=cfdf69ccc91d
> >> (XEN) XXX pts=13d80036988/13d80036988
sps=569ed0bd2ac3/569ed0bd2ac3
> >> (XEN) XXX ps=7:8baebcb8/7:8baebcb8
> >
> > I.e. plt_wrap == stime_platform_stamp, which would imply
> > plt_stamp64 + plt_mask + 1 == platform_timer_stamp, which
> > visibly isn''t the case.
> >
> > Having to consider mis-compilation of some sort at this point,
> > can you please attach the xen-syms matching the xen.gz from
> > the run above?
> 
> how can I do?
xen/xen-syms should be in the place where you built the hypervisor.

-- 
Ian Campbell

It is not doing the thing we like to do, but liking the thing we have to do,
that makes life blessed.
		-- Goethe

Jan Beulich

2012-Nov-26 12:18 UTC

head link

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

>>> On 26.11.12 at 11:40, Ian Campbell <ijc@hellion.org.uk>
wrote:
> On Mon, 2012-11-26 at 11:28 +0100, Mauro wrote:
>> On 26 November 2012 11:20, Jan Beulich <JBeulich@suse.com> wrote:
>> >>>> On 24.11.12 at 13:06, Mauro <mrsanna1@gmail.com>
wrote:
>> >> (XEN) XXX plt_overflow: plt_now=569ed0c10fa3
plt_wrap=569ed0bd2ac3
>> >> now=569ed0bf15a4 old_stamp=36f62 new_stamp=800377ce
>> >> plt_stamp64=13d800377ce plt_mask=ffffffff tsc=cfdf69d72c82
>> >> tsc_stamp=cfdf69ccc91d
>> >> (XEN) XXX pts=13d80036988/13d80036988
sps=569ed0bd2ac3/569ed0bd2ac3
>> >> (XEN) XXX ps=7:8baebcb8/7:8baebcb8
>> >
>> > I.e. plt_wrap == stime_platform_stamp, which would imply
>> > plt_stamp64 + plt_mask + 1 == platform_timer_stamp, which
>> > visibly isn''t the case.
>> >
>> > Having to consider mis-compilation of some sort at this point,
>> > can you please attach the xen-syms matching the xen.gz from
>> > the run above?
>> 
>> how can I do?
> 
> xen/xen-syms should be in the place where you built the hypervisor.
... or, if you didn''t build the hypervisor yourself, you should check
the package you were handed.

Jan

<Philippe.Simonet@swisscom.com>

2012-Nov-26 13:08 UTC

head link

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

Hi jan

I''ve done that for Mauro, here are the symbols :
http://www.ncube.ch/xen-syms.gz

MANY thanks and regards

Philippe



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Monday, November 26, 2012 1:18 PM
> To: Mauro
> Cc: Keir Fraser; Ian Campbell; xen-devel; Simonet Philippe, ITS-OUS-OP-
> IFM-NW-IPE
> Subject: Re: [Xen-devel] #599161: Xen debug patch for the "clock
shifts by 50
> minutes" bug.
> 
> >>> On 26.11.12 at 11:40, Ian Campbell <ijc@hellion.org.uk>
wrote:
> > On Mon, 2012-11-26 at 11:28 +0100, Mauro wrote:
> >> On 26 November 2012 11:20, Jan Beulich <JBeulich@suse.com>
wrote:
> >> >>>> On 24.11.12 at 13:06, Mauro
<mrsanna1@gmail.com> wrote:
> >> >> (XEN) XXX plt_overflow: plt_now=569ed0c10fa3
> plt_wrap=569ed0bd2ac3
> >> >> now=569ed0bf15a4 old_stamp=36f62 new_stamp=800377ce
> >> >> plt_stamp64=13d800377ce plt_mask=ffffffff
tsc=cfdf69d72c82
> >> >> tsc_stamp=cfdf69ccc91d
> >> >> (XEN) XXX pts=13d80036988/13d80036988
> >> >> sps=569ed0bd2ac3/569ed0bd2ac3
> >> >> (XEN) XXX ps=7:8baebcb8/7:8baebcb8
> >> >
> >> > I.e. plt_wrap == stime_platform_stamp, which would imply
> >> > plt_stamp64 + plt_mask + 1 == platform_timer_stamp, which
visibly
> >> > isn''t the case.
> >> >
> >> > Having to consider mis-compilation of some sort at this
point, can
> >> > you please attach the xen-syms matching the xen.gz from the
run
> >> > above?
> >>
> >> how can I do?
> >
> > xen/xen-syms should be in the place where you built the hypervisor.
> 
> ... or, if you didn''t build the hypervisor yourself, you should
check the
> package you were handed.
> 
> Jan

Sylvain Munaut

2012-Nov-26 14:15 UTC

head link

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

Hi,


I just wanted to say I''ve also been hitting this problem for years on
some servers.
That was originally under xen 4.0 when I originally installed the
machines, but even after update to 4.1.3 and more recently a 3.6.4
dom0 kernel, the problem is still present.

More specifically the two hw platforms I''m sure are concerned (hw +
cpu info + debug-key ''s'' info)

Sun Fire x2250:

    cpu family : 6
    model : 23
    model name : Intel(R) Xeon(R) CPU           L5420  @ 2.50GHz
    stepping : 6
    microcode : 0x60b

(XEN) TSC has constant rate, deep Cstates possible, so not reliable,
warp=3292 (count=1)
(XEN) dom1: mode=0,ofs=0x56eb699774,khz=2500128,inc=1,vtsc count:
1168090 kernel, 18678 user


Sun Fire x4150:

    cpu family : 6
    model : 23
    model name : Intel(R) Xeon(R) CPU           E5450  @ 3.00GHz
    stepping : 10
    microcode : 0xa07

(XEN) TSC has constant rate, deep Cstates possible, so not reliable,
warp=3789 (count=1)
(XEN) dom2: mode=0,ofs=0x3a652aed7d,khz=2992564,inc=1,vtsc count:
4247396065 kernel, 712543 user
(XEN) dom3: mode=0,ofs=0x9eb4c44593,khz=2500147,inc=2,vtsc count:
1967559486 kernel, 192516 user


When the problem appears, I can also see the "(XEN) Platform timer
appears to have unexpectedly wrapped 10 or more times." message.

I have now deployed a freshly compiled 4.1.3 with the latest tsc debug
patched compiled and loglvl=all option to a test server with the same
hw where the problem happenned and I''ll report if I see anything.

The problem is pretty rare so not sure when it will happen next ...


Cheers,

     Sylvain

Jan Beulich

2012-Nov-26 14:29 UTC

head link

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

>>> On 26.11.12 at 14:08, <Philippe.Simonet@swisscom.com> wrote:
> I''ve done that for Mauro, here are the symbols : 
> http://www.ncube.ch/xen-syms.gz 
That helped; I''ll send a patch shortly.

Jan

Xen devel - Oct 2012 - #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

#599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

Re: #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.