On Saturday I finally upgraded a machine from CentOS 4.3 (I think) to 4.5 via yum. Seemed to went fine. However, during the following night /home got mounted read-only because of an EXT3-fs error. The next night happened the same. Also, today, I saw the first-ever kernel crash on this machine. The machine is about three years old or so, went into production two years ago with CentOS 4.1 or so and has been rock stable since then. The fs errors, no kernel crashes, no other "weird" occurences. As the problems are now happening right after upgrading to a new kernel I rather suspect a bug in the kernel (or some module) than a hardware problem. No RAID, no LVM, a few partitions on an IDE disk. I didn't file it as a bug yet. I want to first gather some more information or get some help. Here are some details. Kernel was updated from 2.6.9-34.0.2.EL to 2.6.9-55.0.12.EL. There is not a single package update missing now. Dec 9 04:30:35 nx10 kernel: EXT3-fs error (device hda3): htree_dirblock_to_tree: bad entry in directory #1330023: rec_len % 4 != 0 - offset=10264, inode=808542775, rec_len=13621, name_len=100 Dec 9 04:30:35 nx10 kernel: Aborting journal on device hda3. Dec 9 04:30:35 nx10 kernel: ext3_abort called. Dec 9 04:30:35 nx10 kernel: EXT3-fs error (device hda3): ext3_journal_start_sb: Detected aborted journal Dec 9 04:30:35 nx10 kernel: Remounting filesystem read-only The second error tonight happened about 5 minutes earlier. With exactly the same directory inode. http://www.google.de/search?as_q=centos+rec_len+4+0&hl=de&num=30&btnG=Google-Suche&as_epq=bad+entry+in+direc tory&as_oq=&as_eq=&lr=&cr=&as_ft=i&as_filetype=&as_qdr=all&as_occt=any&as_dt=i&as_sitesearch=&as_rights=&saf e=images shows this error is very scarce (I also tried it with fedora and got a few more). It seems to be related to heavy disk i/o, but only under certain (hardware?) circumstances and may be a bug introduced in some Fedora kernel and this krept into RHEL/CentOS 4.4/4.5. Once this happens that filesystem (in my case /home) is read-only and the machine just hangs when one tries to shutdown (probably when unmounting) or remount ro (for a file check). After a hard reset the automatic fschk in dmesg lists only an few orphan inode cleanups. Also, I found that dmesg delivers me an output of the iptables logging (which is on kern.=debug) before the problem is fixed with a reset. Can I use fsdebug safely on that system while mounted? I'm not familiar with it and just stumbled over a mention of it. I tried it on a machine here on a mounted device and there was no problem. That other machine is in a remote data center, so options are a bit limited. The kernel crash from today starts like this: Dec 10 10:30:01 nx10 kernel: Unable to handle kernel paging request at virtual address 8f38df23 Dec 10 10:30:01 nx10 kernel: printing eip: Dec 10 10:30:01 nx10 kernel: c019190b Dec 10 10:30:01 nx10 kernel: *pde = 00000000 Dec 10 10:30:01 nx10 kernel: Oops: 0000 [#1] Dec 10 10:30:01 nx10 kernel: Modules linked in: ipt_REJECT ipt_limit ipt_state ipt_LOG iptable_filter ip_tables ip_conntrack_ftp ip_conntrack md5 ipv6 autofs4 i2c_dev i2c_core sunrpc dm_mirror dm_mod button battery ac 8139too mii ext3 jbd ata_piix libata sd_mod scsi_mod Dec 10 10:30:01 nx10 kernel: CPU: 0 Dec 10 10:30:01 nx10 kernel: EIP: 0060:[<c019190b>] Not tainted VLI Dec 10 10:30:01 nx10 kernel: EFLAGS: 00010282 (2.6.9-55.0.12.EL) Dec 10 10:30:01 nx10 kernel: EIP is at seq_escape+0x21/0xaa Dec 10 10:30:01 nx10 kernel: eax: 8f38df23 ebx: c0370260 ecx: d35a9151 edx: d35aa000 Dec 10 10:30:01 nx10 kernel: esi: c518c200 edi: c518c200 ebp: c032f9d9 esp: c63d9f28 Dec 10 10:30:01 nx10 kernel: ds: 007b es: 007b ss: 0068 Dec 10 10:30:01 nx10 kernel: Process mv (pid: 16585, threadinfo=c63d9000 task=cee5a1b0) Dec 10 10:30:01 nx10 kernel: Stack: d35aa000 8f38df23 c0370260 c518c200 dfe08982 00000000 c018e0e3 c03702c0 Dec 10 10:30:01 nx10 kernel: c518c200 00000000 dfe08982 c019157f 00000151 00000000 00000400 b7fd5000 Dec 10 10:30:01 nx10 kernel: 0000000c 00000000 0000000b 00000000 c0371300 cea00b80 00000400 c63d9fac Dec 10 10:30:01 nx10 kernel: Call Trace: Dec 10 10:30:01 nx10 kernel: [<c018e0e3>] show_vfsmnt+0x28/0xf5 Dec 10 10:30:01 nx10 kernel: [<c019157f>] seq_read+0x1c3/0x2bd Dec 10 10:30:01 nx10 kernel: [<c016c91b>] vfs_read+0xb6/0xe2 Dec 10 10:30:01 nx10 kernel: [<c016cb30>] sys_read+0x3c/0x62 Dec 10 10:30:01 nx10 kernel: [<c031b777>] syscall_call+0x7/0xb I wonder if I can go back to 2.6.9-34.0.2.EL. Should I expect problems with other updated packages? Kai -- Kai Sch?tzl, Berlin, Germany Get your web at Conactive Internet Services: http://www.conactive.com
Alfred von Campe
2007-Dec-10 12:40 UTC
[CentOS] unstable kernel after update to CentOS 4.5
Kai:> Dec 9 04:30:35 nx10 kernel: EXT3-fs error (device hda3): > htree_dirblock_to_tree: bad entry in directory > #1330023: rec_len % 4 != 0 - offset=10264, inode=808542775, > rec_len=13621, name_len=100 > Dec 9 04:30:35 nx10 kernel: Aborting journal on device hda3. > Dec 9 04:30:35 nx10 kernel: ext3_abort called. > Dec 9 04:30:35 nx10 kernel: EXT3-fs error (device hda3): > ext3_journal_start_sb: Detected aborted journal > Dec 9 04:30:35 nx10 kernel: Remounting filesystem read-onlyUpdating to 4.5 was just a coincidence. I believe you have a disk that's going bad. I've seen this error three times and it has always been a bad disk. Backup what you can and replace the disk. Alfred
Kai Schaetzl wrote on Mon, 10 Dec 2007 13:23:26 +0100:> Dec 9 04:30:35 nx10 kernel: EXT3-fs error (device hda3): htree_dirblock_to_tree: bad entry in directory > #1330023: rec_len % 4 != 0 - offset=10264, inode=808542775, rec_len=13621, name_len=100I checked the filesystem in the evening and it's clean. I really doubt there's anything with the disk. What makes me wonder is that high inode number. According to df -i that partition has a number of 3145728 inodes. And debugfs stat on inode 808542775 tells me that one doesn't exist. I don't know how I could access that directory #1330023, but I assume it doesn't exist like inode 808542775. Kai -- Kai Sch?tzl, Berlin, Germany Get your web at Conactive Internet Services: http://www.conactive.com