'k, looks like I'm going to have to back this out ... just upgraded another server to 6.x, CVSup latest -STABLE, built, installed, rebooted ... up fine ... Running a single 'rsync' to copy files from another server over, it has crashed twice in a row so far ... I'm enabling dumpdev right now, and will see if I can a core dump out of it, but, so far, there is nothing being reported in /var/log/messages to indicate a problem ... Does anyone know of any problems with current source tree that I should avoid? And, if so, can someone recommend a "stable date" to CVSup in and try? This server isn't production yet, and I'm not panic'd right now to make it so (basically, I've got a couple of days if I need it) ... ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email . scrappy@hub.org MSN . scrappy@hub.org Yahoo . yscrappy Skype: hub.org ICQ . 7615664
On Sat, 24 Jun 2006, Marc G. Fournier wrote:> > 'k, looks like I'm going to have to back this out ... just upgraded another > server to 6.x, CVSup latest -STABLE, built, installed, rebooted ... up fine > ... > > Running a single 'rsync' to copy files from another server over, it has > crashed twice in a row so far ... > > I'm enabling dumpdev right now, and will see if I can a core dump out of it, > but, so far, there is nothing being reported in /var/log/messages to indicate > a problem ... > > Does anyone know of any problems with current source tree that I should > avoid? And, if so, can someone recommend a "stable date" to CVSup in and > try? This server isn't production yet, and I'm not panic'd right now to make > it so (basically, I've got a couple of days if I need it) ...Just found this in my /var/log/messages file after the last reboot to enable savecore/dumpdev: Jun 25 00:19:59 jupiter kernel: ACPI-0356: *** Error: Region SystemIO(1) has no handler Jun 25 00:19:59 jupiter kernel: ACPI-1304: *** Error: Method execution failed [\_SB_.LN02._STA] (Node 0xc9071920), AE_NOT_EXIST Jun 25 00:19:59 jupiter kernel: ACPI-0239: *** Error: Method execution failed [\_SB_.LN02._STA] (Node 0xc9071920), AE_NOT_EXIST For those on the -acpi list, this machine is an Intel Dual-PIII motherboard ... ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email . scrappy@hub.org MSN . scrappy@hub.org Yahoo . yscrappy Skype: hub.org ICQ . 7615664
> I am not looking for workarounds, like ECC. I want the box to break > immediately once any single component goes wrong...Uh, that *is* what ECC does (or can do). Without ECC your broken hardware continues to run un-noticed. With ECC you can either make it break immediatley, or log an error or continue to run. Stop thinking of ECC as error correction and start thinking of it as error detection. No ECC gives you no way to detect failing memory. -pete.
So what do I need to do to make the box panic() on an ECC error? Is there a kernel parameter, sysctl, or what else? Thanks, M. Pete French schrieb:>>I am not looking for workarounds, like ECC. I want the box to break >>immediately once any single component goes wrong... >> >> > >Uh, that *is* what ECC does (or can do). Without ECC your broken hardware >continues to run un-noticed. With ECC you can either make it break >immediatley, or log an error or continue to run. > >Stop thinking of ECC as error correction and start thinking of it as >error detection. No ECC gives you no way to detect failing memory. > >-pete. > >
Just wanted to say thank you for clearing up my confusion about ECC. And also, I want to excuse for being a bit harsh in some posts. (I am a rather cynic person, this helps me against not going crazy over all this stuff.) Last night, after hours of working on the very same problem without any success at all, I was at the end of my powers. Sorry, I'll try to keep back from posting in such situations in the future. So it seems like I can not track down ram problems in software. Thanks very much, besides my lack of understanding ECC, I wasn't aware of that either. Lesson learned. Since everything worked fine before, I guess something must have broke when I took the machine out of the shelf. But I have decided now to go the easy way out and retire the hardware. This old box isn't worth wasting more time on... M.