Karl Denninger
2005-Mar-29 18:14 UTC
DANGER WILL ROBINSON! SERIOUS problem with current 5.4-PRERELEASE
WARNING! FreeBSD 5.4-PRERELEASE #6: Tue Mar 29 12:44:22 CST 2005 karl@FS.denninger.net:/usr/obj/usr/src/sys/KSD-SMP Copyright (c) 1992-2005 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD 5.4-PRERELEASE #6: Tue Mar 29 12:44:22 CST 2005 karl@FS.denninger.net:/usr/obj/usr/src/sys/KSD-SMP ACPI APIC Table: <DELL PE400SC> Timecounter "i8254" frequency 1193182 Hz quality 0 CPU: Intel(R) Pentium(R) 4 CPU 2.40GHz (2394.01-MHz 686-class CPU) Origin = "GenuineIntel" Id = 0xf29 Stepping = 9 Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE> Hyperthreading: 2 logical CPUs real memory = 267862016 (255 MB) avail memory = 252456960 (240 MB) FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs cpu0 (BSP): APIC ID: 0 cpu1 (AP): APIC ID: 1 ioapic0: Changing APIC ID to 2 ioapic0 <Version 2.0> irqs 0-23 on motherboard npx0: <math processor> on motherboard npx0: INT 16 interface acpi0: <DELL PE400SC> on motherboard acpi0: Power Button (fixed) Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000 acpi_timer0: <24-bit timer at 3.579545MHz> port 0x808-0x80b on acpi0 cpu0: <ACPI CPU> on acpi0 cpu1: <ACPI CPU> on acpi0 acpi_button0: <Power Button> on acpi0 pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0 pci0: <ACPI PCI bus> on pcib0 agp0: <Intel 82875P host to AGP bridge> mem 0xe8000000-0xefffffff at device 0.0 on pci0 pcib1: <PCI-PCI bridge> at device 1.0 on pci0 pci1: <PCI bus> on pcib1 pci1: <display, VGA> at device 0.0 (no driver attached) uhci0: <Intel 82801EB (ICH5) USB controller USB-A> port 0xff80-0xff9f irq 16 at device 29.0 on pci0 usb0: <Intel 82801EB (ICH5) USB controller USB-A> on uhci0 usb0: USB revision 1.0 uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub0: 2 ports with 2 removable, self powered uhci1: <Intel 82801EB (ICH5) USB controller USB-B> port 0xff60-0xff7f irq 19 at device 29.1 on pci0 usb1: <Intel 82801EB (ICH5) USB controller USB-B> on uhci1 usb1: USB revision 1.0 uhub1: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub1: 2 ports with 2 removable, self powered uhci2: <Intel 82801EB (ICH5) USB controller USB-C> port 0xff40-0xff5f irq 18 at device 29.2 on pci0 usb2: <Intel 82801EB (ICH5) USB controller USB-C> on uhci2 usb2: USB revision 1.0 uhub2: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub2: 2 ports with 2 removable, self powered uhci3: <Intel 82801EB (ICH5) USB controller USB-D> port 0xff20-0xff3f irq 16 at device 29.3 on pci0 usb3: <Intel 82801EB (ICH5) USB controller USB-D> on uhci3 usb3: USB revision 1.0 uhub3: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub3: 2 ports with 2 removable, self powered pci0: <serial bus, USB> at device 29.7 (no driver attached) pcib2: <ACPI PCI-PCI bridge> at device 30.0 on pci0 pci2: <ACPI PCI bus> on pcib2 atapci0: <SiI 3112 SATA150 controller> port 0xcd70-0xcd7f,0xcd5c-0xcd5f,0xcd68-0xcd6f,0xcd58-0xcd5b,0xcd60-0xcd67 mem 0xfe7dee00-0xfe7defff irq 21 at device 0.0 on pci2 ata2: channel #0 on atapci0 ata3: channel #1 on atapci0 ahc0: <Adaptec 2940 Ultra SCSI adapter> port 0xce00-0xceff mem 0xfe7df000-0xfe7dffff irq 22 at device 1.0 on pci2 aic7880: Ultra Wide Channel A, SCSI Id=7, 16/253 SCBs rp0: <RocketPort PCI> port 0xcd80-0xcdbf irq 17 at device 2.0 on pci2 RocketPort0 (Version 3.02) 4 ports. pcib3: <PCI-PCI bridge> at device 3.0 on pci2 pci3: <PCI bus> on pcib3 fxp0: <Intel 82558 Pro/100 Ethernet> port 0xbf80-0xbf9f mem 0xfe400000-0xfe4fffff,0xf8001000-0xf8001fff irq 19 at device 4.0 on pci3 miibus0: <MII bus> on fxp0 inphy0: <i82555 10/100 media interface> on miibus0 inphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto fxp0: Ethernet address: 00:d0:b7:6f:ce:e8 fxp1: <Intel 82558 Pro/100 Ethernet> port 0xbfe0-0xbfff mem 0xfe500000-0xfe5fffff,0xf8000000-0xf8000fff irq 18 at device 5.0 on pci3 miibus1: <MII bus> on fxp1 inphy1: <i82555 10/100 media interface> on miibus1 inphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto fxp1: Ethernet address: 00:d0:b7:6f:ce:e9 em0: <Intel(R) PRO/1000 Network Connection, Version - 1.7.35> port 0xcdc0-0xcdff mem 0xfe7e0000-0xfe7fffff irq 18 at device 12.0 on pci2 em0: Ethernet address: 00:0c:f1:c9:df:c5 em0: Speed:N/A Duplex:N/A isab0: <PCI-ISA bridge> at device 31.0 on pci0 isa0: <ISA bus> on isab0 atapci1: <Intel ICH5 UDMA100 controller> port 0xffa0-0xffaf,0x376,0x170-0x177,0x3f6,0x1f0-0x1f7 irq 18 at device 31.1 on pci0 ata0: channel #0 on atapci1 ata1: channel #1 on atapci1 atapci2: <Intel ICH5 SATA150 controller> port 0xfea0-0xfeaf,0xfe30-0xfe33,0xfe20-0xfe27,0xfe10-0xfe13,0xfe00-0xfe07 irq 18 at device 31.2 on pci0 ata4: channel #0 on atapci2 ata5: channel #1 on atapci2 pci0: <serial bus, SMBus> at device 31.3 (no driver attached) pci0: <multimedia, audio> at device 31.5 (no driver attached) fdc0: <floppy drive controller> port 0x3f7,0x3f0-0x3f5 irq 6 drq 2 on acpi0 fd0: <1440-KB 3.5" drive> on fdc0 drive 0 atkbdc0: <Keyboard controller (i8042)> port 0x64,0x60 irq 1 on acpi0 atkbd0: <AT Keyboard> irq 1 on atkbdc0 kbd0 at atkbd0 sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0 sio0: type 16550A sio1: <16550A-compatible COM port> port 0x2f8-0x2ff irq 3 on acpi0 sio1: type 16550A orm0: <ISA Option ROMs> at iomem 0xd1800-0xd3fff,0xd0000-0xd17ff,0xcf800-0xcffff,0xcb000-0xcf7ff,0xc0000-0xcafff on isa0 pmtimer0 on isa0 ppc0: parallel port not found. sc0: <System console> at flags 0x100 on isa0 sc0: VGA <16 virtual consoles, flags=0x300> vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0 RTC BIOS diagnostic error 18<memory_size,fixed_disk> Timecounters tick every 10.000 msec ipfw2 initialized, divert enabled, rule-based forwarding disabled, default to deny, logging disabled acd0: CDROM <Lite-On LTN486S 48x Max/YDS6> at ata1-master UDMA33 ad4: 238475MB <HDS722525VLSA80/V36OA63A> [484521/16/63] at ata2-master SATA150 ad6: 239372MB <Maxtor 6B250S0/BANC1B70> [486344/16/63] at ata3-master SATA150 em0: Link is up 100 Mbps Full Duplex ad8: 239372MB <Maxtor 6B250S0/BANC1980> [486344/16/63] at ata4-master SATA150 ad10: 238475MB <HDS722525VLSA80/V36OA63A> [484521/16/63] at ata5-master SATA150 Waiting 15 seconds for SCSI devices to settle GEOM_MIRROR: Device boot created (id=1131801609). GEOM_MIRROR: Device boot: provider ad8s1 detected. GEOM_MIRROR: Device boot: provider ad10s1 detected. GEOM_MIRROR: Device boot: provider ad10s1 activated. GEOM_MIRROR: Device boot: provider mirror/boot launched. GEOM_MIRROR: Device boot: rebuilding provider ad8s1. sa0 at ahc0 bus 0 target 2 lun 0 sa0: <DEC DLT2000 15/30 GB 840B> Removable Sequential Access SCSI-2 device sa0: 5.000MB/s transfers (5.000MHz, offset 15) SMP: AP CPU #1 Launched! Mounting root from ufs:/dev/mirror/boota WARNING: /dbms was not properly dismounted WARNING: /disk was not properly dismounted WARNING: /archive was not properly dismounted em0: Link is up 100 Mbps Full Duplex Built this afternoon. This has a fix in the ATA code, to wit: mdodd 2005-03-23 04:50:26 UTC FreeBSD src repository Modified files: (Branch: RELENG_5) sys/dev/ata ata-queue.c Log: MFC 1.42: When resubmitting a timed out request, reset donecount. 1.41: Reset timeout when we are back from interrupt. 1.40: Correct logical error, result was that retries wasn't always made but failure reported instead. 1.39: Do not retry on requests that have lost their device during reinit. Approved by: re Revision Changes Path 1.32.2.6 +10 -3 src/sys/dev/ata/ata-queue.c This change is EXTREMELY DANGEROUS. If your system takes a RECOVERABLE DMA Write Error (as is happening frequently on machines with SATA drives on the PCI bus!) you used to get a drive disconnect from the mirror. NOW you end up with a RADICALLY unstable machine. Specifically, interrupts now get seriously screwed up, serial I/O on the machine stops working immediately, and ultimately the machine crashes or hangs in VERY odd ways. This change needs to be backed out immediately until it can be determined why a requeued request destabilizes the system. I have removed the ad4 and ad6 drives here (which are the ones on the PCI bus) until this is addressed. -- -- Karl Denninger (karl@denninger.net) Internet Consultant & Kids Rights Activist http://www.denninger.net My home on the net - links to everything I do! http://scubaforum.org Your UNCENSORED place to talk about DIVING! http://www.spamcuda.net SPAM FREE mailboxes - FREE FOR A LIMITED TIME! http://genesis3.blogspot.com Musings Of A Sentient Mind
Matthew N. Dodd
2005-Mar-29 20:40 UTC
DANGER WILL ROBINSON! SERIOUS problem with current 5.4-PRERELEASE
On Tue, 29 Mar 2005, Karl Denninger wrote:> 1.42: When resubmitting a timed out request, reset donecount. > 1.41: Reset timeout when we are back from interrupt. > 1.40: Correct logical error, result was that retries wasn't always made but > failure reported instead. > 1.39: Do not retry on requests that have lost their device during reinit. > > This change is EXTREMELY DANGEROUS. > > This change needs to be backed out immediately until it can be determined > why a requeued request destabilizes the system.The changes in question are very small. Could you attempt to isolate which one is the cause? Thanks. -- 10 40 80 C0 00 FF FF FF FF C0 00 00 00 00 10 AA AA 03 00 00 00 08 00
Karl Denninger
2005-Mar-31 09:46 UTC
DANGER WILL ROBINSON! SERIOUS problem with current5.4-PRERELEASE - UPDATE (real this time)
On Thu, Mar 31, 2005 at 12:19:13PM -0500, Matthew N. Dodd wrote:> On Thu, 31 Mar 2005, Karl Denninger wrote: > > What do you expect the patch to do, given that removing the delta > > appears to fix the instability problem? > > I expect the patch to properly stop the callout.Ok, so the implication is that the callouts are/were being triggered and dumped on the floor without it? Would you prefer me to validate that path before or after I attempt to validate that removing the first delta fixes the crashes on my production machine? (Reason for the question is that the latter will require most of the rest of the day and evening, while the former can likely be done by 3-4 pm today) -- -- Karl Denninger (karl@denninger.net) Internet Consultant & Kids Rights Activist http://www.denninger.net My home on the net - links to everything I do! http://scubaforum.org Your UNCENSORED place to talk about DIVING! http://www.spamcuda.net SPAM FREE mailboxes - FREE FOR A LIMITED TIME! http://genesis3.blogspot.com Musings Of A Sentient Mind
Matthew N. Dodd
2005-Mar-31 11:20 UTC
DANGER WILL ROBINSON! SERIOUS problem withcurrent5.4-PRERELEASE - UPDATE (real this time)
On Thu, 31 Mar 2005, Karl Denninger wrote:> Would you prefer me to validate that path before or after I attempt to > validate that removing the first delta fixes the crashes on my > production machine? (Reason for the question is that the latter will > require most of the rest of the day and evening, while the former can > likely be done by 3-4 pm today)Your previous email suggests that the callout stuff is to blame. Testing the fix seems like a good plan unless you have any doubts about your earlier results. -- 10 40 80 C0 00 FF FF FF FF C0 00 00 00 00 10 AA AA 03 00 00 00 08 00
Karl Denninger
2005-Mar-31 11:57 UTC
DANGER WILL ROBINSON! SERIOUS problem withcurrent5.4-PRERELEASE - UPDATE (real this time)
On Thu, Mar 31, 2005 at 02:16:11PM -0500, Matthew N. Dodd wrote:> On Thu, 31 Mar 2005, Karl Denninger wrote: > > Would you prefer me to validate that path before or after I attempt to > > validate that removing the first delta fixes the crashes on my > > production machine? (Reason for the question is that the latter will > > require most of the rest of the day and evening, while the former can > > likely be done by 3-4 pm today) > > Your previous email suggests that the callout stuff is to blame. Testing > the fix seems like a good plan unless you have any doubts about your > earlier results.I cannot provide solid validation until I test it on the production machine here. The sandbox survived overnight with dozens of write errors, but whether the production system will is not yet known. Will test on the production system and advise. -- -- Karl Denninger (karl@denninger.net) Internet Consultant & Kids Rights Activist http://www.denninger.net My home on the net - links to everything I do! http://scubaforum.org Your UNCENSORED place to talk about DIVING! http://www.spamcuda.net SPAM FREE mailboxes - FREE FOR A LIMITED TIME! http://genesis3.blogspot.com Musings Of A Sentient Mind