Jonathan Stewart
2006-Sep-12 17:39 UTC
Anyone??? (was Reproducible data corruption on 6.1-Stable)
(I know double posting is bad form but this includes new information and it's been several days. Suggestions on where else to look for help are welcome, highpoint was no help) I set up a new server recently and transferred all the information from my old server over. I tried to use unison to synchronize the backup of pictures I have taken and noticed that a large number of pictures where marked as changed on the server. After checking the pictures by hand I confirmed that many of the pictures on the server were corrupted. I attempted to use unison to update the files on the server with the correct local copies but it would fail on almost all the files with the message "destination updated during synchronization." It appears the corruption happens during the read process because when I recompare the files in a graphical diff tool between cache flushes the differences move around!?!?!? The differences also appear to be very small for the most part, single bytes scattered throughout the file. I really have no idea what is causing the problem and would like to pin it down so I can either replace hardware if it's bad or fix whatever the bug is. I cvsuped and rebuilt world and kernel recently hoping that it had been fixed but with no luck. I have not seen any error messages on the console at all either. I have a pair of 320GB SATA hard drives setup as RAID0 on a HighPoint RocketRaid 1520 card the card BIOS is the latest revision as is the motherboard BIOS. This being a data corruption issue I can afford any amount of downtime needed for trouble shooting as it's not very useful to have the server up if everything is going to get corrupted. Thank you, Jonathan uname -a: FreeBSD XXXXX 6.1-STABLE FreeBSD 6.1-STABLE #0: Sun Sep 10 22:54:17 EDT 2006 root@XXXXX:/usr/obj/usr/src/sys/SERVER i386 dmesg: Copyright (c) 1992-2006 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD 6.1-STABLE #0: Sun Sep 10 22:54:17 EDT 2006 root@XXXXX:/usr/obj/usr/src/sys/SERVER mptable_probe: MP Config Table has bad signature: 4\^C\^_ Timecounter "i8254" frequency 1193182 Hz quality 0 CPU: AMD Athlon(tm) XP 3200+ (2090.16-MHz 686-class CPU) Origin = "AuthenticAMD" Id = 0x6a0 Stepping = 0 Features=0x383fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR,SSE> AMD Features=0xc0400800<SYSCALL,MMX+,3DNow+,3DNow> real memory = 1073676288 (1023 MB) avail memory = 1041698816 (993 MB) kbd1 at kbdmux0 ath_hal: 0.9.17.2 (AR5210, AR5211, AR5212, RF5111, RF5112, RF2413, RF5413) acpi0: <Nvidia AWRDACPI> on motherboard acpi0: Power Button (fixed) Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000 acpi_timer0: <24-bit timer at 3.579545MHz> port 0x4008-0x400b on acpi0 cpu0: <ACPI CPU> on acpi0 acpi_button0: <Power Button> on acpi0 pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0 pci0: <ACPI PCI bus> on pcib0 Correcting nForce2 C1 CPU disconnect hangs agp0: <NVIDIA nForce2 AGP Controller> mem 0xd8000000-0xdbffffff at device 0.0 on pci0 pci0: <memory, RAM> at device 0.1 (no driver attached) pci0: <memory, RAM> at device 0.2 (no driver attached) pci0: <memory, RAM> at device 0.3 (no driver attached) pci0: <memory, RAM> at device 0.4 (no driver attached) pci0: <memory, RAM> at device 0.5 (no driver attached) isab0: <PCI-ISA bridge> at device 1.0 on pci0 isa0: <ISA bus> on isab0 pci0: <serial bus, SMBus> at device 1.1 (no driver attached) ohci0: <OHCI (generic) USB controller> mem 0xe1085000-0xe1085fff irq 5 at device 2.0 on pci0 ohci0: [GIANT-LOCKED] usb0: OHCI version 1.0, legacy support usb0: <OHCI (generic) USB controller> on ohci0 usb0: USB revision 1.0 uhub0: nVidia OHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub0: 3 ports with 3 removable, self powered ohci1: <OHCI (generic) USB controller> mem 0xe1082000-0xe1082fff irq 5 at device 2.1 on pci0 ohci1: [GIANT-LOCKED] usb1: OHCI version 1.0, legacy support usb1: <OHCI (generic) USB controller> on ohci1 usb1: USB revision 1.0 uhub1: nVidia OHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub1: 3 ports with 3 removable, self powered ehci0: <NVIDIA nForce2 USB 2.0 controller> mem 0xe1083000-0xe10830ff irq 12 at device 2.2 on pci0 ehci0: [GIANT-LOCKED] usb2: EHCI version 1.0 usb2: companion controllers, 4 ports each: usb0 usb1 usb2: <NVIDIA nForce2 USB 2.0 controller> on ehci0 usb2: USB revision 2.0 uhub2: nVidia EHCI root hub, class 9/0, rev 2.00/1.00, addr 1 uhub2: 6 ports with 6 removable, self powered nve0: <NVIDIA nForce MCP2 Networking Adapter> port 0xe400-0xe407 mem 0xe1084000-0xe1084fff irq 12 at device 4.0 on pci0 nve0: Ethernet address 00:0c:6e:7d:e0:79 miibus0: <MII bus> on nve0 rlphy0: <RTL8201L 10/100 media interface> on miibus0 rlphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto nve0: Ethernet address: 00:0c:6e:7d:e0:79 pci0: <multimedia, audio> at device 5.0 (no driver attached) pci0: <multimedia, audio> at device 6.0 (no driver attached) pcib1: <ACPI PCI-PCI bridge> at device 8.0 on pci0 pci1: <ACPI PCI bus> on pcib1 atapci0: <HighPoint HPT372N UDMA133 controller> port 0xa000-0xa007,0xa400-0xa403,0xa800-0xa807,0xac00-0xac03,0xb000-0xb0ff irq 11 at device 6.0 on pci1 ata2: <ATA channel 0> on atapci0 ata3: <ATA channel 1> on atapci0 pci1: <multimedia, audio> at device 9.0 (no driver attached) pci1: <input device> at device 9.1 (no driver attached) atapci1: <nVidia nForce2 UDMA133 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 9.0 on pci0 ata0: <ATA channel 0> on atapci1 ata1: <ATA channel 1> on atapci1 pcib2: <ACPI PCI-PCI bridge> at device 12.0 on pci0 pci2: <ACPI PCI bus> on pcib2 xl0: <3Com 3c920B-EMB Integrated Fast Etherlink XL> port 0xc000-0xc07f mem 0xdd000000-0xdd00007f irq 5 at device 1.0 on pci2 miibus1: <MII bus> on xl0 acphy0: <AC101L 10/100 media interface> on miibus1 acphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto xl0: Ethernet address: 00:26:54:10:8c:0f pcib3: <ACPI PCI-PCI bridge> at device 30.0 on pci0 pci3: <ACPI PCI bus> on pcib3 pci3: <display, VGA> at device 0.0 (no driver attached) fdc0: <floppy drive controller> port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on acpi0 fdc0: does not respond device_attach: fdc0 attach returned 6 sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0 sio0: type 16550A sio1: <16550A-compatible COM port> port 0x2f8-0x2ff irq 3 on acpi0 sio1: type 16550A ppc0: <ECP parallel printer port> port 0x378-0x37f,0x778-0x77b irq 7 drq 3 on acpi0 ppc0: SMC-like chipset (ECP/EPP/PS2/NIBBLE) in COMPATIBLE mode ppc0: FIFO with 16/16/16 bytes threshold ppbus0: <Parallel port bus> on ppc0 plip0: <PLIP network interface> on ppbus0 lpt0: <Printer> on ppbus0 lpt0: Interrupt-driven port ppi0: <Parallel I/O> on ppbus0 atkbdc0: <Keyboard controller (i8042)> port 0x60,0x64 irq 1 on acpi0 atkbd0: <AT Keyboard> irq 1 on atkbdc0 kbd0 at atkbd0 atkbd0: [GIANT-LOCKED] fdc0: <floppy drive controller> port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on acpi0 fdc0: does not respond device_attach: fdc0 attach returned 6 pmtimer0 on isa0 orm0: <ISA Option ROMs> at iomem 0xd0000-0xd17ff,0xd6000-0xd67ff on isa0 sc0: <System console> at flags 0x100 on isa0 sc0: VGA <16 virtual consoles, flags=0x300> vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0 Timecounter "TSC" frequency 2090164914 Hz quality 800 Timecounters tick every 1.000 msec ad0: 194481MB <Maxtor 6L200R0 BAH41G10> at ata0-master UDMA133 acd0: DVDROM <CREATIVE DVD-ROM DVD1241E/VER 0.44> at ata0-slave UDMA33 ad4: 305245MB <Seagate ST3320620AS 3.AAC> at ata2-master UDMA133 ad6: 305245MB <Seagate ST3320620AS 3.AAC> at ata3-master UDMA133 ar0: 610490MB <HighPoint v2 RocketRAID RAID0 (stripe 16 KB)> status: READY ar0: disk0 READY using ad4 at ata2-master ar0: disk1 READY using ad6 at ata3-master Trying to mount root from ufs:/dev/ad0s1a
George Hartzell
2006-Sep-12 18:07 UTC
Anyone??? (was Reproducible data corruption on 6.1-Stable)
Jonathan Stewart writes: > [...] > I set up a new server recently and transferred all the information from > my old server over. I tried to use unison to synchronize the backup of > pictures I have taken and noticed that a large number of pictures where > marked as changed on the server. After checking the pictures by hand I > confirmed that many of the pictures on the server were corrupted. I > attempted to use unison to update the files on the server with the > correct local copies but it would fail on almost all the files with the > message "destination updated during synchronization." > > It appears the corruption happens during the read process because when I > recompare the files in a graphical diff tool between cache flushes the > differences move around!?!?!? The differences also appear to be very > small for the most part, single bytes scattered throughout the file. I > really have no idea what is causing the problem and would like to pin it > down so I can either replace hardware if it's bad or fix whatever the > bug is. > [...] It might be a memory problem. I had a linux server that was serving a subversion repository, plus some web stuff. I added some additional memory to keep it from wheezing and it seemed to be running fine. We started noticing problems with things that had been checked out of the repository (e.g. binary tarballs). Removing the extra memory made things work again. memtest86 didn't find anything wrong, which I gather isn't that unusual in these situations. Then again, your problem might be something else entirely.... g.
Oliver Fromme
2006-Sep-13 02:12 UTC
Anyone??? (was Reproducible data corruption on 6.1-Stable)
Jonathan Stewart <jonathan@kc8onw.net> wrote: > I set up a new server recently and transferred all the information from > my old server over. I tried to use unison to synchronize the backup of > pictures I have taken and noticed that a large number of pictures where > marked as changed on the server. After checking the pictures by hand I > confirmed that many of the pictures on the server were corrupted. I > attempted to use unison to update the files on the server with the > correct local copies but it would fail on almost all the files with the > message "destination updated during synchronization." > > It appears the corruption happens during the read process because when I > recompare the files in a graphical diff tool between cache flushes the > differences move around!?!?!? The differences also appear to be very > small for the most part, single bytes scattered throughout the file. I > really have no idea what is causing the problem and would like to pin it > down so I can either replace hardware if it's bad or fix whatever the > bug is. That very much sounds like bad RAM, or overclocked CPU or bus. I assume you do not overclock, so I recommend you replace your RAM modules and check if the symptoms are gone. Also check your BIOS settings for the RAM timings. Setting the timings to more conservative values might already solve the problem. Best regards Oliver -- Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing Dienstleistungen mit Schwerpunkt FreeBSD: http://www.secnetix.de/bsd Any opinions expressed in this message may be personal to the author and may not necessarily reflect the opinions of secnetix in any way. "Clear perl code is better than unclear awk code; but NOTHING comes close to unclear perl code" (taken from comp.lang.awk FAQ)
Daniel Gerzo
2006-Sep-14 13:53 UTC
Anyone??? (was Reproducible data corruption on 6.1-Stable)
Hello Jonathan, Wednesday, September 13, 2006, 2:38:14 AM, you wrote:> I set up a new server recently and transferred all the information from > my old server over. I tried to use unison to synchronize the backup of > pictures I have taken and noticed that a large number of pictures where > marked as changed on the server. After checking the pictures by hand I > confirmed that many of the pictures on the server were corrupted.> It appears the corruption happens during the read process because when I > recompare the files in a graphical diff tool between cache flushes the > differences move around!?!?!? The differences also appear to be very > small for the most part, single bytes scattered throughout the file. I > really have no idea what is causing the problem and would like to pin it > down so I can either replace hardware if it's bad or fix whatever the > bug is.> CPU: AMD Athlon(tm) XP 3200+ (2090.16-MHz 686-class CPU) > Origin = "AuthenticAMD" Id = 0x6a0 Stepping = 0I saw very similar simptons on p4 3.2ghz. I was able to build world without any problems and the overall stability of the machine was completely good, but when I tried to install some ports, the md5 sums didn't match the source and I was sure that they were all right. The following simple test demonstrates the problem I was hitting: root@[bigbang ~]# sha256 /usr/ports/distfiles/ruby/ruby-1.8.4.tar.gz SHA256 (/usr/ports/distfiles/ruby/ruby-1.8.4.tar.gz) = b95ddf27bc0ffa379c9aa881ca39e92a7d79e0d08999b4dff6d7d9547ee2a72d root@[bigbang ~]# sha256 /usr/ports/distfiles/ruby/ruby-1.8.4.tar.gz SHA256 (/usr/ports/distfiles/ruby/ruby-1.8.4.tar.gz) = 71432841b3965b7ab2d83f0dc7c3049195ea4e9267a8dc2d825a8a0466982930 root@[bigbang ~]# sha256 /usr/ports/distfiles/ruby/ruby-1.8.4.tar.gz SHA256 (/usr/ports/distfiles/ruby/ruby-1.8.4.tar.gz) = 83e44f5301b3270e821850164c74d275f6721bed5d126480cf518a9fe5ca0d6c root@[bigbang ~]# md5 < /usr/ports/distfiles/ruby/ruby-1.8.4.tar.gz bd8c2e593e1fa4b01fd98eaf016329bb root@[bigbang ~]# md5 < /usr/ports/distfiles/ruby/ruby-1.8.4.tar.gz bd8c2e593e1fa4b01fd98eaf016329bb root@[bigbang ~]# md5 < /usr/ports/distfiles/ruby/ruby-1.8.4.tar.gz b9342bb213393238dd37322d4e2ee3fe root@[bigbang ~]# md5 < /usr/ports/distfiles/ruby/ruby-1.8.4.tar.gz 88efa7977fd3febaa8d260e3d5f21917 The memtest didn't show any problems with RAM and we were unable to clarify what is really going on. Then we managed to get the machine replaced with the complete new hardware and the problem was gone. Later, I was told that it is some kind of known bug in older p4's bioses (and advised to update the bios which should have been fixed in the meantime) but we were unable to find out any information about the problem. Fortunately the colo company replaced the hardware with no problems. So long so good and the box is running flawlessly. -- Best regards, Daniel mailto:danger@FreeBSD.org