thr3ads.net - freebsd stable - cpu timer issues [Sep 2010]

If this information is useful, please help other people find it:
Share via:

Jurgen Weber

2010-Sep-28 08:05 UTC

cpu timer issues

Hello List

We have been having issues with some firewall machines of ours using
pfSense.

FreeBSD smash01.ish.com.au 7.2-RELEASE-p5 FreeBSD 7.2-RELEASE-p5 #0: Sun
Dec 6 23:20:31 EST 2009
sullrich@FreeBSD_7.2_pfSense_1.2.3_snaps.pfsense.org:/usr/obj.pfSense/usr/pfSensesrc/src/sys/pfSense_SMP.7
i386

MotherBoard:
http://www.supermicro.com/products/motherboard/Xeon3000/3200/X7SBi-LN4.cfm

Originally the systems started out by showing a lot of packet loss, the
system time would fall behind, and the value of "#vmstat -i | grep
timer" was dropping below 2000. I was lead to believe by the guys at
pfSense that this is where the value should sit. I would also receive
errors in messages that looked like " kernel: calcru: runtime went
backwards from 244314 usec to 236341".

We tried a variety of things, disabling USB, turning off the Intel Speed
Step in the BIOS, disabling ACPI, etc, etc. All having little to no
effect. The only thing that would right it is restarting the box but
over time it would degrade again. I talked to the SuperMicro and they
said that this is a FreeBSD issue and pretty much washed their hands of it.

After a couple of months of dealing with this and just rebooting the
systems reguarly, the symptoms slowly but surely disappeared. eg. The
kernel messages went away, the system time was not falling behind and I
was experiencing no packet loss but the "#vmstat -i | grep timer"
value
would continue to decrease over time. Eventually I think, when it
finally got the 0 the machine restarted (I am only guessing here).

After this restart it worked again for a couple of hours and then it
restarted again.

After the second time the system has not missed a beat, it has been fine
and the "#vmstat -i | grep timer" value remained near the 2000 mark...
We setup some zabbix monitoring to watch it. As mentioned it was fine
for about a month. Until today. Today the value has dropped to 0, but
the system has not restarted and over the last couple of hours the value
has increased to 47.

This machine is mission critical, we have two in a fail over scenario
(using pfSense's CARP features) and it seems unfortunate that we have an
issue with two brand new SuperMicro boxes that affect both machines.
While at the moment everything seems fine I want to ensure that I have
no further issues. Does anyone have any suggestions?

Lastly I have double check both of the below:
http://www.freebsd.org/doc/en_US.ISO8859-1/books/faq/troubleshoot.html#CALCRU-NEGATIVE-RUNTIME
We disabled EIST.

http://www.freebsd.org/doc/en_US.ISO8859-1/books/faq/troubleshoot.html#COMPUTER-CLOCK-SKEW

# dmesg | grep Timecounter
Timecounter "i8254" frequency 1193182 Hz quality 0
Timecounters tick every 1.000 msec
# sysctl kern.timecounter.hardware
kern.timecounter.hardware: i8254

Only have one timer to choose from.

Thanks

Jurgen

Jeremy Chadwick

2010-Sep-28 09:31 UTC

head link

cpu timer issues

On Tue, Sep 28, 2010 at 05:54:15PM +1000, Jurgen Weber
wrote:>  Hello List
> 
> We have been having issues with some firewall machines of ours using
> pfSense.
> 
> FreeBSD smash01.ish.com.au 7.2-RELEASE-p5 FreeBSD 7.2-RELEASE-p5 #0:
> Sun Dec  6 23:20:31 EST 2009
sullrich@FreeBSD_7.2_pfSense_1.2.3_snaps.pfsense.org:/usr/obj.pfSense/usr/pfSensesrc/src/sys/pfSense_SMP.7
> i386
> 
> MotherBoard:
http://www.supermicro.com/products/motherboard/Xeon3000/3200/X7SBi-LN4.cfm
> 
> Originally the systems started out by showing a lot of packet loss,
> the system time would fall behind, and the value of "#vmstat -i |
> grep timer" was dropping below 2000. I was lead to believe by the
> guys at pfSense that this is where the value should sit. I would
> also receive errors in messages that looked like " kernel: calcru:
> runtime went backwards from 244314 usec to 236341".
> 
> We tried a variety of things, disabling USB, turning off the Intel
> Speed Step in the BIOS, disabling ACPI, etc, etc. All having little
> to no effect. The only thing that would right it is restarting the
> box but over time it would degrade again. I talked to the SuperMicro
> and they said that this is a FreeBSD issue and pretty much washed
> their hands of it.
> 
> After a couple of months of dealing with this and just rebooting the
> systems reguarly, the symptoms slowly but surely disappeared. eg.
> The kernel messages went away, the system time was not falling
> behind and I was experiencing no packet loss but the "#vmstat -i |
> grep timer" value would continue to decrease over time. Eventually I
> think, when it finally got the 0 the machine restarted (I am only
> guessing here).
> 
> After this restart it worked again for a couple of hours and then it
> restarted again.
> 
> After the second time the system has not missed a beat, it has been
> fine and the "#vmstat -i | grep timer" value remained near the
2000
> mark... We setup some zabbix monitoring to watch it. As mentioned it
> was fine for about a month. Until today. Today the value has dropped
> to 0, but the system has not restarted and over the last couple of
> hours the value has increased to 47.
> 
> This machine is mission critical, we have two in a fail over
> scenario (using pfSense's CARP features) and it seems unfortunate
> that we have an issue with two brand new SuperMicro boxes that
> affect both machines. While at the moment everything seems fine I
> want to ensure that I have no further issues. Does anyone have any
> suggestions?
> 
> Lastly I have double check both of the below:
>
http://www.freebsd.org/doc/en_US.ISO8859-1/books/faq/troubleshoot.html#CALCRU-NEGATIVE-RUNTIME
> We disabled EIST.
> 
>
http://www.freebsd.org/doc/en_US.ISO8859-1/books/faq/troubleshoot.html#COMPUTER-CLOCK-SKEW
> 
> # dmesg | grep Timecounter
> Timecounter "i8254" frequency 1193182 Hz quality 0
> Timecounters tick every 1.000 msec
> # sysctl kern.timecounter.hardware
> kern.timecounter.hardware: i8254
> 
> Only have one timer to choose from.
I have a subrevision of this motherboard in use in production, which ran
RELENG_7 and now runs RELENG_8, without any of the problems you
describe.  I don't have any experience with the -LN4 submodel though,
although I do have experience with the X7SBA-LN4.

Our hardware in question:

http://www.supermicro.com/products/system/1U/5015/SYS-5015B-MT.cfm

The machine in question consists of 4 disks (1 OS, 3 ZFS raidz1), uses
both NICs (two separate networks) at gigE rates, handles nightly backups
for all other servers, acts as an NFS server, a time source (ntpd) for
other servers on the network, and a serial console head.  Oh, it also
has EIST enabled, and runs powerd with some minor (well-known) tunings
in loader.conf for it.

Secondly, here's our sysctl kern.timecounter tree on our system, in
addition to our SMBIOS details (proving the system is what I say it is).
Note that we have multiple timecounter choices, and APCI-fast is chosen.
I would expect problems if i8254 was chosen, but the question is why
this is being chosen on your systems and why alternate timecounter
choices aren't available.  You said you tried booting with ACPI
disabled, which might explain why ACPI-fast or ACPI-safe are missing.

$ sysctl kern.timecounter
kern.timecounter.tick: 1
kern.timecounter.choice: TSC(-100) ACPI-fast(1000) i8254(0) dummy(-1000000)
kern.timecounter.hardware: ACPI-fast
kern.timecounter.stepwarnings: 0
kern.timecounter.tc.i8254.mask: 65535
kern.timecounter.tc.i8254.counter: 47135
kern.timecounter.tc.i8254.frequency: 1193182
kern.timecounter.tc.i8254.quality: 0
kern.timecounter.tc.ACPI-fast.mask: 16777215
kern.timecounter.tc.ACPI-fast.counter: 188736
kern.timecounter.tc.ACPI-fast.frequency: 3579545
kern.timecounter.tc.ACPI-fast.quality: 1000
kern.timecounter.tc.TSC.mask: 4294967295
kern.timecounter.tc.TSC.counter: 2830682562
kern.timecounter.tc.TSC.frequency: 2333508681
kern.timecounter.tc.TSC.quality: -100
kern.timecounter.smp_tsc: 0
kern.timecounter.invariant_tsc: 1

$ kenv | grep smbios
smbios.bios.reldate="07/24/2009"
smbios.bios.vendor="Phoenix Technologies LTD"
smbios.bios.version="1.30      "
smbios.chassis.maker="Supermicro"
smbios.chassis.serial="0123456789"
smbios.chassis.tag=" "
smbios.chassis.version="0123456789"
smbios.memory.enabled="8388608"
smbios.planar.maker="Supermicro"
smbios.planar.product="X7SBi"
smbios.planar.serial="0123456789"
smbios.planar.version="PCB Version"
smbios.socket.enabled="1"
smbios.socket.populated="1"
smbios.system.maker="Supermicro"
smbios.system.product="X7SBi"
smbios.system.serial="0123456789"
smbios.system.uuid="XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
smbios.system.version="0123456789"
smbios.version="2.5"

Fourthly, here's our BIOS settings (using BIOS 1.30, which is referred
to as "R 1.3a" on Supermicro's site):

--------------------
Supermicro SuperServer 5015B-MT BIOS Settings
============================================Current BIOS: 1.30
============================================
Reset to Factory Defaults, then change:

* Main
    * Date
         --> Set to GMT, not local time!
    * Serial ATA
         --> Native Mode Operation --> Serial ATA
         --> SATA AHCI Enable      --> Enabled

* Advanced
    * Boot Features
         --> Quiet Boot --> Disabled
    * I/O Device Configuration
         --> Serial port B --> Disabled
         --> Parallel port --> Disabled
    * Console Redirection
         --> Com Port Address         --> On-board COM A
         --> Baud Rate                --> 115.2K
         --> Console Type             --> VT100+
         --> Continue C.R. after POST --> On  (SEE NOTE #2)


NOTE #2: CR after POST
=======================If the system is running RELENG_7, ***do not*** enable
this option.  The
bootloader and thus kernel appear to get confused by who controls the
interrupt, and you end up without *any* serial console output period.

RELENG_8 has addressed this problem, and you *should* enable this feature
when using that OS.  This will allow you to see LAN option ROM messages
during PXE booting, or boot0 (if you use it; usually we don't).
--------------------

Since you have two systems with the same problem, I really don't know
what to tell you.  What I can tell you is that we've run RELENG_7 and
RELENG_8 on all of the following hardware without any problems:

* Supermicro SuperServer 5015B-MTB
  http://www.supermicro.com/products/system/1U/5015/SYS-5015B-MT.cfm
* Supermicro SuperServer 5015M-T+B
  http://www.supermicro.com/products/system/1U/5015/SYS-5015M-T_.cfm
* Supermicro X7SBA
  http://www.supermicro.com/products/motherboard/Xeon3000/3210/X7SBA.cfm
* Supermicro X7SBL-LN2
  http://www.supermicro.com/products/motherboard/Xeon3000/3200/X7SBL-LN2.cfm

Can you provide any tuning you do in loader.conf or sysctl.conf, as well
as your kernel configuration?

Otherwise, if you continue to have problems of this nature, I would
strongly recommend replacing the hardware.  Clock skew of this nature,
at least based on what I've seen at my day/night job, is usually the
sign of a crystal going bad on the motherboard.  Yes, I realise you have
two systems which are exhibiting the same behaviour, but for all I know
a manufacturer (not Supermicro) released a batch of bad crystals into
the market.

-- 
| Jeremy Chadwick                                   jdc@parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |

Andriy Gapon

2010-Sep-28 10:07 UTC

head link

cpu timer issues

on 28/09/2010 10:54 Jurgen Weber said the following:> # dmesg | grep Timecounter
> Timecounter "i8254" frequency 1193182 Hz quality 0
> Timecounters tick every 1.000 msec
> # sysctl kern.timecounter.hardware
> kern.timecounter.hardware: i8254
> 
> Only have one timer to choose from.
Can you provide a little bit more of "hard" data than the above?
Specifically, the following sysctls:
kern.timecounter
dev.cpu

Output of vmstat -i.
_Verbose_ boot dmesg.

Please do not disable ACPI when taking this data.
Preferably, upload it somewhere and post a link to it.
-- 
Andriy Gapon

borislav nikolov

2010-Sep-28 11:02 UTC

head link

cpu timer issues

On 28.09.2010, at 10:54, Jurgen Weber <jurgen@ish.com.au> wrote:
> Hello List
> 
> We have been having issues with some firewall machines of ours using
pfSense.
> 
> FreeBSD smash01.ish.com.au 7.2-RELEASE-p5 FreeBSD 7.2-RELEASE-p5 #0: Sun
Dec  6 23:20:31 EST 2009
sullrich@FreeBSD_7.2_pfSense_1.2.3_snaps.pfsense.org:/usr/obj.pfSense/usr/pfSensesrc/src/sys/pfSense_SMP.7
i386
> 
> MotherBoard:
http://www.supermicro.com/products/motherboard/Xeon3000/3200/X7SBi-LN4.cfm
> 
> Originally the systems started out by showing a lot of packet loss, the
system time would fall behind, and the value of "#vmstat -i | grep
timer" was dropping below 2000. I was lead to believe by the guys at
pfSense that this is where the value should sit. I would also receive errors in
messages that looked like " kernel: calcru: runtime went backwards from
244314 usec to 236341".
> 
> We tried a variety of things, disabling USB, turning off the Intel Speed
Step in the BIOS, disabling ACPI, etc, etc. All having little to no effect. The
only thing that would right it is restarting the box but over time it would
degrade again. I talked to the SuperMicro and they said that this is a FreeBSD
issue and pretty much washed their hands of it.
> 
> After a couple of months of dealing with this and just rebooting the
systems reguarly, the symptoms slowly but surely disappeared. eg. The kernel
messages went away, the system time was not falling behind and I was
experiencing no packet loss but the "#vmstat -i | grep timer" value
would continue to decrease over time. Eventually I think, when it finally got
the 0 the machine restarted (I am only guessing here).
> 
> After this restart it worked again for a couple of hours and then it
restarted again.
> 
> After the second time the system has not missed a beat, it has been fine
and the "#vmstat -i | grep timer" value remained near the 2000 mark...
We setup some zabbix monitoring to watch it. As mentioned it was fine for about
a month. Until today. Today the value has dropped to 0, but the system has not
restarted and over the last couple of hours the value has increased to 47.
> 
> This machine is mission critical, we have two in a fail over scenario
(using pfSense's CARP features) and it seems unfortunate that we have an
issue with two brand new SuperMicro boxes that affect both machines. While at
the moment everything seems fine I want to ensure that I have no further issues.
Does anyone have any suggestions?
> 
> Lastly I have double check both of the below:
>
http://www.freebsd.org/doc/en_US.ISO8859-1/books/faq/troubleshoot.html#CALCRU-NEGATIVE-RUNTIME
> We disabled EIST.
> 
>
http://www.freebsd.org/doc/en_US.ISO8859-1/books/faq/troubleshoot.html#COMPUTER-CLOCK-SKEW
> 
> # dmesg | grep Timecounter
> Timecounter "i8254" frequency 1193182 Hz quality 0
> Timecounters tick every 1.000 msec
> # sysctl kern.timecounter.hardware
> kern.timecounter.hardware: i8254
> 
> Only have one timer to choose from.
> 
> Thanks
> 
> Jurgen
> 
> _______________________________________________
> freebsd-stable@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to
"freebsd-stable-unsubscribe@freebsd.org"

Hello,
vmsat -i calculates interrupt rate based on interrupt count/uptime, and the
interrupt count is 32 bit integer.
With high values of kern.hz it will overflow in few days (with kern.hz=4000 it
will happen every 12 days or so).
If that is the case, use systat -vmstat 1 to get accurate interrupt rate.
That is just fyi, because i was confused once and it scared me abit, and i
started changing counters untill i noticed this.

p.s. please forgive my poor english

Jurgen Weber

2010-Sep-29 03:49 UTC

head link

cpu timer issues

Andriy

You can find everything you are after here:

http://pastebin.com/WH4V2W0F

Thanks

Jurgen

On 28/09/10 8:07 PM, Andriy Gapon wrote:> on 28/09/2010 10:54 Jurgen Weber said the following:
>> # dmesg | grep Timecounter
>> Timecounter "i8254" frequency 1193182 Hz quality 0
>> Timecounters tick every 1.000 msec
>> # sysctl kern.timecounter.hardware
>> kern.timecounter.hardware: i8254
>>
>> Only have one timer to choose from.
>
> Can you provide a little bit more of "hard" data than the above?
> Specifically, the following sysctls:
> kern.timecounter
> dev.cpu
>
> Output of vmstat -i.
> _Verbose_ boot dmesg.
>
> Please do not disable ACPI when taking this data.
> Preferably, upload it somewhere and post a link to it.
-- 
-------------------------->
ish
http://www.ish.com.au
Level 1, 30 Wilson Street Newtown 2042 Australia
phone +61 2 9550 5001   fax +61 2 9550 4001

Jurgen Weber

2010-Sep-29 04:05 UTC

head link

cpu timer issues

Interesting, using systat everything looks fine. The interrupts hang 
around 2000.

Thanks

Jurgen

On 28/09/10 8:33 PM, borislav nikolov wrote:> Hello,
> vmsat -i calculates interrupt rate based on interrupt count/uptime, and the
interrupt count is 32 bit integer.
> With high values of kern.hz it will overflow in few days (with kern.hz=4000
it will happen every 12 days or so).
> If that is the case, use systat -vmstat 1 to get accurate interrupt rate.
> That is just fyi, because i was confused once and it scared me abit, and i
started changing counters untill i noticed this.
>
> p.s. please forgive my poor english
-- 
-------------------------->
ish
http://www.ish.com.au
Level 1, 30 Wilson Street Newtown 2042 Australia
phone +61 2 9550 5001   fax +61 2 9550 4001

Jeremy Chadwick

2010-Sep-29 07:29 UTC

head link

cpu timer issues

On Wed, Sep 29, 2010 at 01:49:39PM +1000, Jurgen Weber
wrote:> Andriy
> 
> You can find everything you are after here:
> 
> http://pastebin.com/WH4V2W0F
The information provided here shows ACPI is disabled in addition to the
boot not being verbose.

-- 
| Jeremy Chadwick                                   jdc@parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |

Jurgen Weber

2010-Sep-29 21:51 UTC

head link

cpu timer issues

I do not understand what you mean by a verbose dmesg...... looking at 
the man page there is no verbose option for dmesg except what I 
completed (dmesg -a).

Once that is clarified I can reboot the backup machine and turn on ACPI 
for you.

On 29/09/10 5:29 PM, Jeremy Chadwick wrote:> On Wed, Sep 29, 2010 at 01:49:39PM +1000, Jurgen Weber wrote:
>> Andriy
>>
>> You can find everything you are after here:
>>
>> http://pastebin.com/WH4V2W0F
>
> The information provided here shows ACPI is disabled in addition to the
> boot not being verbose.
>
-- 
-------------------------->
ish
http://www.ish.com.au
Level 1, 30 Wilson Street Newtown 2042 Australia
phone +61 2 9550 5001   fax +61 2 9550 4001

freebsd stable - Sep 2010 - cpu timer issues

cpu timer issues

cpu timer issues

cpu timer issues

cpu timer issues

cpu timer issues

cpu timer issues

cpu timer issues

cpu timer issues