Hello All. I have five Centos 4.4 x86_64 (amd fx-62) boxes with 4 gig of Corsair dual channel ram running kernel 2.6.9-42.0.10.EL (smp). The sixth box is the same except for video card and ram. It is using 8gig of GSkill dual channel ram and has a very cheap ATI video card in it. This box will segfault on boot unless I use noapic acpi=off and nolapic. Even with these boot params, the box is not 100% stable. I have seen udev segfault right when booting and also when copying 15 meg files from /tmp to /usr will lock things up sometimes. Yum and Perl have gotten corrupted as well at different times. I have 5 fans in the box to vent heat and the BIOS reports the CPU running around 90 degrees Farenheit when rebooting. I have run memtest86 (from Knoppix 5.1) all night long on the 8 gig of ram and it comes up OK. In the past, we have found problematic Corsair ram this way. This 8 gig seems to be OK as far as memtest thinks. Today I will begin removing ram and testing to see what happens. Does anyone have some ideas what might be the problem? All the other five boxes with Corsair ram work fine. Thanks
on the unstable machine..what mobo, cpu, nic, vid, hdd, etc etc etc. tblader wrote:> Hello All. > I have five Centos 4.4 x86_64 (amd fx-62) boxes with > 4 gig of Corsair dual channel ram running > kernel 2.6.9-42.0.10.EL (smp). > > The sixth box is the same except for video card and ram. > It is using 8gig of GSkill dual channel ram and has > a very cheap ATI video card in it. This box will segfault > on boot unless I use noapic acpi=off and nolapic. Even > with these boot params, the box is not 100% stable. I have > seen udev segfault right when booting and also when copying > 15 meg files from /tmp to /usr will lock things up sometimes. > Yum and Perl have gotten corrupted as well at different times. > I have 5 fans in the box to vent heat and the BIOS reports > the CPU running around 90 degrees Farenheit when rebooting. > > I have run memtest86 (from Knoppix 5.1) all night long on the > 8 gig of ram and it comes up OK. In the past, we have found > problematic Corsair ram this way. This 8 gig seems to be > OK as far as memtest thinks. > > Today I will begin removing ram and testing to see what > happens. Does anyone have some ideas what might be the > problem? All the other five boxes with Corsair ram work > fine. > > Thanks > _______________________________________________ > CentOS mailing list > CentOS at centos.org > http://lists.centos.org/mailman/listinfo/centos > >-- My "Foundation" verse: Isa 54:17 No weapon that is formed against thee shall prosper; and every tongue that shall rise against thee in judgment thou shalt condemn. This is the heritage of the servants of the LORD, and their righteousness is of me, saith the LORD. -- carpe ductum -- "Grab the tape" CDTT (Certified Duct Tape Technician) Linux user #322099 Machines: 206822 256638 276825 http://counter.li.org/
> I have run memtest86 (from Knoppix 5.1) all night long on the > 8 gig of ram and it comes up OK. In the past, we have found > problematic Corsair ram this way. This 8 gig seems to be > OK as far as memtest thinks. > > Today I will begin removing ram and testing to see what > happens. Does anyone have some ideas what might be the > problem? All the other five boxes with Corsair ram work > fine.Is this the only box with 8Gb RAM? Do the others have smaller amounts? Is the motherboard and BIOS the same? Are you running the same kernel? Is the machine stable if you run it with mem=1024M on the kernel options line (so limitting it to 1Gb RAM). Do you need the hugemem kernel? -- rgds Stephen
On Thursday 22 March 2007, tblader wrote:> Hello All. > I have five Centos 4.4 x86_64 (amd fx-62) boxes with > 4 gig of Corsair dual channel ram running > kernel 2.6.9-42.0.10.EL (smp). > > The sixth box is the same except for video card and ram.Have you tried to switch memory modules between two of your machines? ..would be interesting to see if the problem follows that memory or stays with the box.> It is using 8gig of GSkill dual channel ram and has > a very cheap ATI video card in it.Do you use fglrx (ati's kernel module). If so, does it die even if you don't load it? /Peter -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: <http://lists.centos.org/pipermail/centos/attachments/20070322/e4424947/attachment.sig>
Peter Kjellstrom wrote:> On Thursday 22 March 2007, tblader wrote: >> Hello All. >> I have five Centos 4.4 x86_64 (amd fx-62) boxes with >> 4 gig of Corsair dual channel ram running >> kernel 2.6.9-42.0.10.EL (smp). >> >> The sixth box is the same except for video card and ram. > > Have you tried to switch memory modules between two of your machines? ..would > be interesting to see if the problem follows that memory or stays with the > box. > >> It is using 8gig of GSkill dual channel ram and has >> a very cheap ATI video card in it. > > Do you use fglrx (ati's kernel module). If so, does it die even if you don't > load it?I have not swapped memory modules for the Corsair ones yet however I have pulled one stick of the Gskill ram out to leave 6 gig now. Will see if I can reproduce the lockup with the 6 gig and keep removing ram until something changes. I can swap out the GSkill for the corsair during the weekend as the other boxes are in use during the day. I have left the X configuration as default. xorg.conf shows vesa mode and the following modules (probably only used when vncserver is running on localhost?): ... Section "Module" Load "dbe" Load "extmod" Load "fbdevhw" Load "glx" Load "record" Load "freetype" Load "type1" Load "dri" EndSection ... Section "Device" Identifier "Videocard0" Driver "vesa" VendorName "Videocard vendor" BoardName "VESA driver (generic)" EndSection ... -- Flambeau Inc. Technology Center - Baraboo, WI Email : tblader at flambeau.com Keyserver: http://pgp.mit.edu KeyID: 0x00E9EC2C
Hi All, Just a follow up. Pulling out one stick (2 gig) of the 8 seems to fix the problem. I ran into the most problems while installing Sun's Java (jre-1_5_0_11-linux-i586.bin) while this is unzipping into /usr/local is when things consistently lock up. This is what I've been using to test things. There is sufficient disk space in the partition. I bypassed the prompt during the install and made a loop that removes the jre1.5.0_11 directory and then just does sh -x on the jre-1_5_0_11-linux-i586.bin. Leaving this run will lock the machine within a couple loops. Removing 1 stick of ram , the loop continues to run at least for 15 minutes before I terminated it. I've moved the ram around in different slots to see if it made a difference, as well as removed different single sticks to test. It seems the only thing that improves stability is to remove a single stick out of either the channel A or channel B slots. On the last test with all 4 sticks in, this message was printed to my ssh session right before hurling. oas kernel: Assertion failure in __journal_temp_unlink_buffer() at fs/jbd/transaction.c:1521: "jbd_is_locked_bh_state(bh)" Hopefully this all makes sense to someone more knowlegeable than I. Thomas -- Flambeau Inc. Technology Center - Baraboo, WI Email : tblader at flambeau.com Keyserver: http://pgp.mit.edu KeyID: 0x00E9EC2C