Dear Sirs. First, please do not reply on this address, your reply will never reach me. Please contact me at ohartman@web.de. I can not post into this newsgroup via web.de due to SPAM exclusion of several web.de hosts. As I reported very often in the past I have still massvie problems with SMP enabled on a FreeBSD 5.3-RELEASE-p1 __and__ FreeBSD 5.3-STABLE box. The crash is always of the same typus as I can 'watch' how the machine freezes and for some lucky moments I am able to switch to the console before the box dies definitely and watch what error message comes up. This machine is a ASUS CUR-DLS maiboard, utilizing the RCC ServerWorks chipset, version 3 for Pentium 3 CPUs. At this moment I use two Intel 1GHz CPUs of the same stepping, but prior to this error report I used two CPUs with 866 Mhz and of different steppings, but it seems to make no difference. I also tried a lot of kernel options, especially those which are supposed to be critical (means: I switched them off) and I used a GENERIC kernel for a while, but it makes no difference. The crash occurs while using a graphical console, Xorg X11 (version 4.7.0 as compiled from the ports), fvwm2 (develepmonet version, but crash occurs also with windowmaker so the GUI seems not to be an issue). I also tried to fix the problem by using built in fxp-NIC instead of the 64Bit Intel GBit LAN adapter (em0), but it is always the same. I will append a mptable -verbose -dmesg output for your information and I will add the error message I receive. Most time when the crash occurs I did a lot of graphical load (working on several TIFF files 200MB in size or with Mozilla/FireFox), but this may simply trigger or fasten up the problem. Sometimes I can not get a 'systat -vmstat 1' output, calling vmstat in systat results in 'Alternate system clock has died. Reverting to ''pigs'' ...'. This happens very often in SMP, but not in UP. I will add, that the UP system (SMP disabled by kern.smp.disable='1' in loader.conf) was up for nearly 13 days under same conditions when a SMP box crashes after several minutes, sevral hours. This is the last console error I received: Fatal trap 12: page fault while in kernel mode cpuid = 1; apic id = 00 fault virtual address = 0x1c fault code = supervisor write, page not present instruction pointer = 0x8:0xc062ac76 stack pointer = 0x10:0x4e2d7ac frame pointer = 0x10:0xe4e2d7c4 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, def32 1, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 44 (swi5: clock sio) [thread 100042] Stopped at vref +0x16: lock cmpxchgl %edx, 0x1c(%edx) I am not a technical thug nor a kernel programmer. I tried to figure out what command got executed at address via recommended mn -n kernel|grep c062ac76 and it results in 'T vref'. What is 'swi5: clock sio'? Is this problem hardware related? Why only in SMP? Others seem not to have problems with 5.3 and SMP, maybe this is very specific to me due to the RCC based mainboard I use (in the past I had a lot of problems with a TYAN 2500 mobo also based on ServerWorks chipset in conjunction with FreeBSD 4/5). This is my mptable-output:============================================================================== MPTable, version 2.0.15 looking for EBDA pointer @ 0x040e, found, searching EBDA @ 0x0009f000 searching CMOS 'top of mem' @ 0x0009ec00 (635K) searching default 'top of mem' @ 0x0009fc00 (639K) searching BIOS @ 0x000f0000 MP FPS found in BIOS @ physical addr: 0x000f5270 ------------------------------------------------------------------------------- MP Floating Pointer Structure: location: BIOS physical address: 0x000f5270 signature: '_MP_' length: 16 bytes version: 1.4 checksum: 0xe3 mode: Virtual Wire ------------------------------------------------------------------------------- MP Config Table Header: physical address: 0x000f4e60 signature: 'PCMP' base table length: 276 version: 1.4 checksum: 0x0d OEM ID: 'OEM00000' Product ID: 'PROD00000000' OEM table pointer: 0x00000000 OEM table size: 0 entry count: 26 local APIC address: 0xfee00000 extended table length: 124 extended table checksum: 198 ------------------------------------------------------------------------------- MP Config Base Table Entries: -- Processors: APIC ID Version State Family Model Step Flags 3 0x11 BSP, usable 6 8 6 0x387fbff 0 0x11 AP, usable 6 8 6 0x387fbff -- Bus: Bus ID Type 0 PCI 1 PCI 2 ISA -- I/O APICs: APIC ID Version State Address 2 0x11 usable 0xfec00000 3 0x11 usable 0xfec01000 -- I/O Ints: Type Polarity Trigger Bus ID IRQ APIC ID PIN# ExtINT conforms conforms 2 0 2 0 INT conforms conforms 2 1 2 1 INT conforms conforms 2 0 2 2 INT conforms conforms 2 3 2 3 INT conforms conforms 2 4 2 4 INT conforms conforms 2 6 2 6 INT conforms conforms 2 7 2 7 INT conforms conforms 2 8 2 8 INT conforms conforms 2 12 2 12 INT conforms conforms 2 13 2 13 INT conforms conforms 2 14 2 14 INT conforms conforms 2 15 2 15 INT active-lo level 0 15:A 3 14 INT active-lo level 2 9 2 9 INT active-lo level 1 3:A 3 6 INT active-lo level 1 5:A 3 8 INT active-lo level 1 5:B 3 9 -- Local Ints: Type Polarity Trigger Bus ID IRQ APIC ID PIN# ExtINT active-hi edge 2 0 255 0 NMI active-hi edge 2 0 255 1 ------------------------------------------------------------------------------- MP Config Extended Table Entries: -- System Address Space bus ID: 0 address type: I/O address address base: 0x0 address range: 0x10000 -- System Address Space bus ID: 0 address type: memory address address base: 0x40000000 address range: 0xbebe0000 -- System Address Space bus ID: 0 address type: prefetch address address base: 0xfebe0000 address range: 0xe9420000 -- System Address Space bus ID: 0 address type: memory address address base: 0xe8000000 address range: 0x18000000 -- System Address Space bus ID: 0 address type: memory address address base: 0xa0000 address range: 0x20000 -- Bus Heirarchy bus ID: 2 bus info: 0x01 parent bus ID: 0 -- Compatibility Bus Address bus ID: 0 address modifier: add predefined range: 0x00000000 -- Compatibility Bus Address bus ID: 0 address modifier: add predefined range: 0x00000001 ------------------------------------------------------------------------------- dmesg output: WARNING: /compat was not properly dismounted WARNING: /homes was not properly dismounted WARNING: /usr was not properly dismounted WARNING: /usr/data was not properly dismounted WARNING: /usr/local was not properly dismounted WARNING: /usr/obj was not properly dismounted /usr/obj: mount pending error: blocks 21296 files 928 /usr/obj: superblock summary recomputed WARNING: /usr/scratch was not properly dismounted WARNING: /usr/src was not properly dismounted WARNING: /var was not properly dismounted pflog0: promiscuous mode enabled em0: Link is up 100 Mbps Full Duplex em0: promiscuous mode enabled em0: promiscuous mode disabled ===================================================================
Pruning -smp crosspost since I'm not on that list. To: address updated accordingly. On Sat, 20 Nov 2004, Oliver Hartmann wrote:> First, please do not reply on this address, your reply will never reach > me. Please contact me at ohartman@web.de. I can not post into this > newsgroup via web.de due to SPAM exclusion of several web.de hosts. > > As I reported very often in the past I have still massvie problems with > SMP enabled on a FreeBSD 5.3-RELEASE-p1 __and__ FreeBSD 5.3-STABLE box. > The crash is always of the same typus as I can 'watch' how the machine > freezes and for some lucky moments I am able to switch to the console > before the box dies definitely and watch what error message comes up.The panic caught below appears to have dropped you into ddb. Could you run 'tr' and post the output along with the panic output next time you trigger this?> This machine is a ASUS CUR-DLS maiboard, utilizing the RCC ServerWorks > chipset, version 3 for Pentium 3 CPUs. At this moment I use two Intel > 1GHz CPUs of the same stepping, but prior to this error report I used > two CPUs with 866 Mhz and of different steppings, but it seems to make > no difference. > > I also tried a lot of kernel options, especially those which are > supposed to be critical (means: I switched them off) and I used a > GENERIC kernel for a while, but it makes no difference. The crash occurs > while using a graphical console, Xorg X11 (version 4.7.0 as compiled > from the ports), fvwm2 (develepmonet version, but crash occurs also with > windowmaker so the GUI seems not to be an issue). I also tried to fix > the problem by using built in fxp-NIC instead of the 64Bit Intel GBit > LAN adapter (em0), but it is always the same.What are you using for disk? Are you using the built-in ATA controller?> I will append a mptable -verbose -dmesg output for your information and > I will add the error message I receive. Most time when the crash occurs > I did a lot of graphical load (working on several TIFF files 200MB in > size or with Mozilla/FireFox), but this may simply trigger or fasten up > the problem.Are these operations compute- or i/o-intensive or are CPU or I/O bound?> Sometimes I can not get a 'systat -vmstat 1' output, calling vmstat in > systat results in 'Alternate system clock has died. Reverting to > ''pigs'' ...'. This happens very often in SMP, but not in UP.Thats not good. That may indicate interrupt routing problems, and ASUS is traditionally bad at writing ACPI code. You may try disabling ACPI if you haven't already.> I will add, that the UP system (SMP disabled by kern.smp.disable='1' in > loader.conf) was up for nearly 13 days under same conditions when a SMP > box crashes after several minutes, sevral hours.Good to know.> This is the last console error I received: > > Fatal trap 12: page fault while in kernel mode > cpuid = 1; apic id = 00 > fault virtual address = 0x1c > fault code = supervisor write, page not present > instruction pointer = 0x8:0xc062ac76 > stack pointer = 0x10:0x4e2d7ac > frame pointer = 0x10:0xe4e2d7c4 > code segment = base 0x0, limit 0xfffff, type 0x1b > = DPL 0, pres 1, def32 1, gran 1 > processor eflags = interrupt enabled, resume, IOPL = 0 > current process = 44 (swi5: clock sio) > [thread 100042] > Stopped at vref +0x16: lock cmpxchgl %edx, 0x1c(%edx)Hm, null vnode reference. vref() just increments the usecount on a vnode, but its surrounded by mutex operations on that vnode which use that particular instruction. Considering that the releases in question are not known to have these types of problems I'd say we're looking at a hardware problem.> What is 'swi5: clock sio'? Is this problem hardware related? Why only in > SMP? Others seem not to have problems with 5.3 and SMP, maybe this is > very specific to me due to the RCC based mainboard I use (in the past I > had a lot of problems with a TYAN 2500 mobo also based on ServerWorks > chipset in conjunction with FreeBSD 4/5).I've run Linux on that series of Tyan board (2510 and the later 2518) with only one problem -- the onboard ATA controller is known to cause data corruption and should not be used under any circumstances. The 2518 ships with an onboard Promise that works. If you are using the onboard ATA controller I strongly suggest using some other disk interface. A temporary workaround would be to turn off DMA mode on your disks by adding this to /boot/loader.conf and rebooting: hw.ata.ata_dma="0" This will of course cause a huge performance impact, particularly if your work is I/O bound. I'd also check for the usual hardware suspects -- cooling problems (insufficient heatsinks, broken fans, poorly designed airflow, etc.), overclocking, bad or incorrect memory, bad processor, bad motherboard. -- Doug White | FreeBSD: The Power to Serve dwhite@gumbysoft.com | www.FreeBSD.org