Here I don't want to start a discussion, but rather share a *solution* that took me some days to come up with. The whole stuff started with a power-outage. After reboot, my server (ext3) came back with the dreaded file system error (Ctrl-D to reboot or password for maintenance) - if you've never seen this, consider yourself lucky! In any case, I did the fsck as prescribed, but abandoned the effort after having to type the 'Y' for more than 100 times; restarting fsck with '-y'. *Never* ever do this on ext3, as you will see later! It started but then informed me about " ... too many errors" or so. Next, after reboot, it came with a kernel-panic: No init found. This is almost the time for re-install, isn't it!? No, I tried the repair before. Bad luck, while reading my nice root-partition (hda6), it complained "Error mounting filesystem on hda6: Invalid argument", and "You don't have any Linux partitions. Press return ..."; though I could *see* all files on hda6 nicely at the shell. I checked fstab: okay. I could even mount /dev/hda6 to /mnt/help; looking pretty sane. Though I was loosing out on my sanity ... ! Finally, enlightenment crossed my mind and here is the problem and the solution: The "fsck -y" (see above) had not been able to handle all the errors and made the journal unusable. This is why all rescue and booting ended in disarray: the corrupted journal made the partition look invalid as ext3, though it was not so bad. I only had to convert it to ext2, have it repair all the errors and finally recreate the journal; effectively reconvert it to ext3. It is been running ever since without problem. I am even pondering to consider that behaviour a bug, since a somewhat minor problem made things worse (kernel panic!) unnecessarily. It seems the journal got corrupted not by the outage but by simply too many automated 'Y' *during repair* !? At least, it had been okay for the first boot after the outage. Remarkable! Uwe
Here I don't want to start a discussion, but rather share a *solution* that took me some days to come up with. The whole stuff started with a power-outage. After reboot, my server (ext3) came back with the dreaded file system error (Ctrl-D to reboot or password for maintenance) - if you've never seen this, consider yourself lucky! In any case, I did the fsck as prescribed, but abandoned the effort after having to type the 'Y' for more than 100 times; restarting fsck with '-y'. *Never* ever do this on ext3, as you will see later! It started but then informed me about " ... too many errors" or so. Next, after reboot, it came with a kernel-panic: No init found. This is almost the time for re-install, isn't it!? No, I tried the repair before. Bad luck, while reading my nice root-partition (hda6), it complained "Error mounting filesystem on hda6: Invalid argument", and "You don't have any Linux partitions. Press return ..."; though I could *see* all files on hda6 nicely at the shell. I checked fstab: okay. I could even mount /dev/hda6 to /mnt/help; looking pretty sane. Though I was loosing out on my sanity ... ! Finally, enlightenment crossed my mind and here is the problem and the solution: The "fsck -y" (see above) had not been able to handle all the errors and made the journal unusable. This is why all rescue and booting ended in disarray: the corrupted journal made the partition look invalid as ext3, though it was not so bad. I only had to convert it to ext2, have it repair all the errors and finally recreate the journal; effectively reconvert it to ext3. It is been running ever since without problem. I am even pondering to consider that behaviour a bug, since a somewhat minor problem made things worse (kernel panic!) unnecessarily. It seems the journal got corrupted not by the outage but by simply too many automated 'Y' *during repair* !? At least, it had been okay for the first boot after the outage. Remarkable! Uwe
On Mon, May 20, 2002 at 04:57:19PM +0800, Uwe Dippel wrote:> Here I don't want to start a discussion, but rather share a *solution* > that took me some days to come up with. > The whole stuff started with a power-outage. After reboot, my server > (ext3) came back with the dreaded file system error (Ctrl-D to reboot or > password for maintenance) - if you've never seen this, consider yourself > lucky! > In any case, I did the fsck as prescribed, but abandoned the effort > after having to type the 'Y' for more than 100 times; restarting fsck > with '-y'. *Never* ever do this on ext3, as you will see later!What sort of errors was fsck finding?> It started but then informed me about " ... too many errors" or so. > Next, after reboot, it came with a kernel-panic: No init found.So the root fs had failed to load...> This is almost the time for re-install, isn't it!? No, I tried the > repair before. Bad luck, while reading my nice root-partition (hda6), it > complained "Error mounting filesystem on hda6: Invalid argument"yep. Did the kernel give any error log as to why it failed to mount?> I only had to convert it to ext2, have it > repair all the errors and finally recreate the journal; effectively > reconvert it to ext3. It is been running ever since without problem. > I am even pondering to consider that behaviour a bugIf fsck left the fs unmountable, then yes, it is a bug. The question is, what damage did it fail to repair? We'd need a lot more detail in your bug report to be sure of that. Cheers, Stephen
On Mon, May 20, 2002 at 04:57:19PM +0800, Uwe Dippel wrote:> In any case, I did the fsck as prescribed, but abandoned the effort > after having to type the 'Y' for more than 100 times; restarting fsck > with '-y'. *Never* ever do this on ext3, as you will see later! > It started but then informed me about " ... too many errors" or so.There is no "too many errors" printed by e2fsck. The closest is "too many illegal blocks" in an inode, which is followed by an offer to clear the inode entirely. If an inode has a corrupted indirect block, this can cause e2fsck to offer to give up on the inode altogether. This normal behaviour, and shouldn't have caused any problem (unless the system really needed the inode which had been corrupted in order for its boot scripts, of course).> Next, after reboot, it came with a kernel-panic: No init found. > This is almost the time for re-install, isn't it!? No, I tried the > repair before. Bad luck, while reading my nice root-partition (hda6), it > complained "Error mounting filesystem on hda6: Invalid argument", and > "You don't have any Linux partitions. Press return ..."; though I could > *see* all files on hda6 nicely at the shell. I checked fstab: okay. I > could even mount /dev/hda6 to /mnt/help; looking pretty sane. Though I > was loosing out on my sanity ... !What distribution are you using (i.e., RedHat, Mandrake, Debian, etc)., and do you know how the root partition was trying to be mounted? Based on what you reported, it sounds like the boot scripts were only trying to mount it as ext3, and not as anything else. Normally the initrd scripts are set up to load the ext3 module, if necessary, and then attempts a mount of the filesystem without specifying the filesystem type. That way, the kernel will automatically attempt to mount the filesystem first as ext3, and then if that fails, it will fall back and attempt a mount as ext2.> The "fsck -y" (see above) had not been able to handle all the errors and > made the journal unusable. This is why all rescue and booting ended in > disarray: the corrupted journal made the partition look invalid as ext3, > though it was not so bad. I only had to convert it to ext2, have it > repair all the errors and finally recreate the journal; effectively > reconvert it to ext3. It is been running ever since without problem. > I am even pondering to consider that behaviour a bug, since a somewhat > minor problem made things worse (kernel panic!) unnecessarily. It seems > the journal got corrupted not by the outage but by simply too many > automated 'Y' *during repair* !? At least, it had been okay for the > first boot after the outage. Remarkable!There are some filesystem corruptions that will cause e2fsck to offer to delete the journal, after which it will explicitly state that the journal inode has been removed, and that the filesystem has been recoverted to ext2. It doesn't take a lot of filesystem errors, and this will happen with or without "fsck -y" (that's just a red-herring). Certain specific filesystem corruptions simply corrupt the journal file, and cause e2fsck to need to remove it. It's possible that there might have been some subtle filesystem corruption which where e2fsck doesn't clear out the journal inode and recoverts the filesystem to ext2, but you haven't given us enough information to know exactly what the filesystem corruption was. (See the man page for e2fsck, in the REPORTING BUGS section for a discussion of the sort of information that is really needed to for me to be able to reproduce, find, and fix e2fsck bugs.) As I said, "fsck -y" is a red-herring. All -y does is the equivalent of your typing 'y' to every question asked by e2fsck; it just saves a little keyboard wear and tear. The real question is how your filesystem was corrupted, and how e2fsck reacted to that particular form of filesystem corruption. E2fsck *should* be able to handle any just about any form of filesystem corruption, for ext2 or ext3, and it surely shouldn't make things worse. But of course, no software is bug-free, and you may have found some new and innovative way for your filesystems to be corrupted, which e2fsck doesn't deal with correctly. But I would need a lot more information to be able to figure out exactly what happened. - Ted
On Thu, May 23, 2002 at 10:08:03AM +0800, Uwe Dippel wrote:> Please accept, that I wrote the stuff as closely as possible. I had even > taken a few notes. If you are not a filesystem hacker, you'll probably > just try to repair the whole stuff and this is what I did. You are > probably right on the "too many illegal blocks". In any case, I answered > all with '-y' and found the situation as described after reboot: > "Mounting /proc filesystem > Creating root device > Mounting root filesystem > ext3: No journal on filesystem on ide0(3,6) > mount: error 22 mounting ext3 > pivotroot: pivot_root(/sysroot, /sysroot/initrd) failed: > Freeing unused kernel memeory: 220k free > Kernel panic: No init found" > (Please don't bash me for mistakes copying this manually twice!)OK, given what you've told me, I suspect you can reproduce the failure simply by running coverting your root partition back to ext2 (boot a rescue system, and then run the command "tune2fs -O ^has_journal" to remove the journal), and but leave the fstab entry as ext3. I don't keep a have a recent Red Hat system hanging around, so I can't test it myself, but it seems pretty clear from what you've describe. (I probably should have a RH machine around, but it's s a pain to keep up with security package updates for two different distributions, and all of my machines are running Debian these days. I should have one, though, since it's handy for debugging these sorts of boot script-specific problems, which tend to be distribution specific.) Anyway, what happens is that mount is failing because the fstab entry is saying that the filesystem is ext3, but since the ext3 mount is failing, the root filesystem is never mounted, which makes the pivotroot operation fail, and that's why you're getting the "no init found" error message. One of way of making things more robust is to use an /etc/fstab entry which looks like this: /dev/hda1 / ext3,ext2 defaults 1 1 This will cause mount and fsck to first try ext3, and then ext2, so that in the case where the filesystem is converted back to ext2 due to a filesystem error, the system will actually come back up cleanly. The only gotcha here is that the code to support having a list of filesystems in /etc/fstab is relatively new, and I don't know what version of Red Hat youre running; if it's too old, it might not have a sufficiently new enough version of mount and fsck (in the util-linux and e2fsprogs packages, respectively). Another way of solving the problem would be to add something like the following to the Red Hat's initrd linuxrc, before it tries to mount the real root partition: ext3root=`awk '{ if (($2 == "/") && ($3 == "ext3")) {print $1;}}' /etc/fstab` if test -n "$ext3root" ; then tune2fs --j $ext3root > /dev/null fi The tune2fs command is a no-op if the filesystem already has a journal, but if it doesn't, it will automatically add a journal to the filesystem. So this is nice, both because it allows Red Hat to recover from a missing journal automatically, but also because it means it becomes very simple to convert your root filesystem to ext3. All you need to do is to edit /etc/fstab so that root's fs type is ext3, and then reboot the system.> And here my enhancement proposal drops in: The data were quite well, > only mounting as ext3 failed. And this was non-trivial to find out, > since - as described - the 'Repair' option of the Install-CDROMs also > complained about this "...Invalid Argument ..... You don't have any > Linux partitions". Re-install is option of choice, you might think; but: > no, deleting the journal and a normal fsck on the resulting ext2 brought > everything back to normalYes, you're right. The system ought to be more robust about the discrepancy between what's in /etc/fstab and whether the filesystem is ext3 or ext2. However, this is a distribution specific problem, not one which I can fix in e2fsprogs. So what I'd suggest you do is to file an enhancement request with Red Hat. (Or Stephen, who works for Red Hat, may be able to file this enhancement request for you.) I will file a similar enhancement request for Debian. - Ted
UWE HEINZ RUDI DIPPEL
2002-May-25 11:55 UTC
RE: A solution to Kernel Panic ... on ext3 only!
Dear all, thanks for your comments. Now I really feel I ought to have done a better documentation of what I did and once again ask for your forgiveness on this! *If* and only *if* I remember well, Theodore, contrary to what you wrote in your first mail, fsck had not degraded the ext3 to ext2 gracefully and there was a journal left (for deletion), which I did. Secondly, and here I ask Stephen before messing up the machine again, if it had been a clean, journal-less ext2, the repair-option should not have said that "You don't have any Linux partitions", should it? There must have been something very wrong with hda6, to come up with this message. Eventually, here it might be helpful to know, which conditions make the repair option issue this statement. This had even made me run an fdisk to see, if the partitions had been destroyed, which returned a clear '83'; finally motivating me to continue. fdisk was of the opinion '83' and 'repair' of the opinion '!=83' ?? strange! Even if we don't manage to reconstruct the situation: Thanks again for all your efforts! Uwe