On Sat, 3 Jun 2006, Brian Tao wrote:
> I had a very stable 6.1-R amd64 server (once I swapped out some
> bad RAM, that is) that needed a couple more hard drives installed.
> There were some problems with the upgrade (device renumbering woes,
> basically... topic of another thread), and it had to be rolled back.
>
> Upon rolling back, the previously-good kernel would no longer
> complete the boot after the device probe. I saw two types of panics
> on the serial console:
>
> | Trying to mount root from ufs:/dev/ad4s1a
> | Lookup of /dev for devfs, error: 20
Error 20 is ENOTDIR which means something along the requested path exists,
but it is not a directory. From this output it looks the root directory
entry is somehow corrupted or being misinterpeted.
> | exec /sbin/init: error 20
> | exec /sbin/oinit: error 20
> | exec /sbin/init.bak: error 20
> | exec /rescue/init: error 20
> | exec /stand/sysinstall: error 20
> | init: not found in path
> | /sbin/init:/sbin/oinit:/sbin/init.bak:/rescue/init:/stand/sysinstall
> | panic: no init
> | Uptime: 8s
> | Cannot dump. No dump device defined.
> | Automatic reboot in 15 seconds - press a key on the console to abort
> | --> Press a key on the console to reboot,
> | --> or switch off the system now.
>
> ... and:
>
> | Trying to mount root from ufs:/dev/ad4s1a
> | pid 47 (sh), uid 0: exited on signal 11
> | TPTE at 0xffff8000040028e0 IS ZERO @ VA 80051c000
> | panic: bad pte
> | Uptime: 8s
This is usually indicative of bad RAM or a faulty processor. Since you
seem to be having disk problems, it may just be due to the disk returning
faulty data. Or there is a bad kernel module in the mix that is randomly
corrupting data.
> The first one is suggesting that /dev does not exist (or is not a
> directory)... I'm thinking this means that devfs is somehow
> unavailable, but I did not think it is even possible to disable devfs
> via the kernel config file these days.
>
> The second one leaves me clueless... I have not been able to find
> any useful information on that panic during boot. Granted, I've only
> see the "bad pte" panic twice... all other reboot attempts result
in
> the first type of problem.
>
> Fortunately, I did happen to keep an old 6.0-RELEASE-p6 kernel
> around (Apr 15 2006 build). That kernel boots fine, using the same
> filesystem as newer kernels on that drive. I am up-to-date with the
> RELENG_6_1 tag. Should I perhaps to a make installkernel installworld
> before rebooting? The installed binaries on the server are from an
> early 6.1-RELEASE (which *was* successfully booted by this server). I
> am running into a few minor but surmountable problems because of the
> older kernel version, but I obviously would like to get my world and
> kernel back in sync ASAP.
My gut feeling is that there is still a disconnect on what the root
filesystem is. That or there is hidden corruption that 6.0 isn't noticing
that 6.1 is. Here's what I'd do next:
1. Capture the boot output from both the working 6.0 kernel and your
broken 6.1 kernel and compare the two. If there are differences or errors
being returned from the ATA controller or disks then those will need to be
addressed.
2. Try a splat-over reinstall of 6.1-R from CD to force everything to
match up. Mount the filesystems but don't mark them to be newfs'd.
Install
the GENERIC kernel only.
If you are going to be tracking a branch, please read the instructions at
the end of src/UPDATING on how to perform the build. There is a specific
procedure and not following it can cause significant issues. While
unlikely, it is possible to irreparibly damage the system by not following
the instructions to the letter.
--
Doug White | FreeBSD: The Power to Serve
dwhite@gumbysoft.com | www.FreeBSD.org