Hi all I got this on an a crash on a OSS (Lustre 1.6.3) : root at oss01 ~]# cat /proc/fs/lustre/health_check device lustre-OST0012 reported unhealthy device lustre-OST0014 reported unhealthy device lustre-OST0016 reported unhealthy NOT HEALTHY In /var/log/messages we have : Dec 17 14:40:56 oss01 kernel: LDISKFS-fs error (device dm-15): read_block_bitmap: Invalid block bitmap - block_group = 10648, block = 348913664 Dec 17 14:40:56 oss01 kernel: Remounting filesystem read-only Dec 17 14:40:56 oss01 kernel: LDISKFS-fs error (device dm-17): mb_free_blocks: double-free of inode 232644664''s block 930695936(bit 19200 in group 28402) Dec 17 14:40:56 oss01 kernel: Dec 17 14:40:56 oss01 kernel: Remounting filesystem read-only Dec 17 14:40:56 oss01 kernel: LDISKFS-fs error (device dm-17): mb_free_blocks: double-free of inode 232644664''s block 930695937(bit 19201 in group 28402) Dec 17 14:40:56 oss01 kernel: Dec 17 14:40:56 oss01 kernel: LDISKFS-fs error (device dm-17): mb_free_blocks: double-free of inode 232644664''s block 930695938(bit 19202 in group 28402) Dec 17 14:40:56 oss01 kernel: Dec 17 14:40:56 oss01 kernel: LDISKFS-fs error (device dm-17): mb_free_blocks: double-free of inode 232644664''s block 930695939(bit 19203 in group 28402) Dec 17 14:40:56 oss01 kernel: Dec 17 14:40:56 oss01 kernel: LDISKFS-fs error (device dm-17): mb_free_blocks: double-free of inode 232644664''s block 930695940(bit 19204 in group 28402) Dec 17 14:40:56 oss01 kernel: Dec 17 14:40:56 oss01 kernel: LDISKFS-fs error (device dm-17): mb_free_blocks: double-free of inode 232644664''s block 930695941(bit 19205 in group 28402) Dec 17 14:40:56 oss01 kernel: Dec 17 14:40:56 oss01 kernel: LDISKFS-fs error (device dm-17): mb_free_blocks: double-free of inode 232644664''s block 930695942(bit 19206 in group 28402) Dec 17 14:40:56 oss01 kernel: Dec 17 14:40:56 oss01 kernel: LDISKFS-fs error (device dm-17): mb_free_blocks: double-free of inode 232644664''s block 930695943(bit 19207 in group 28402) Dec 17 14:40:56 oss01 kernel: Dec 17 14:40:56 oss01 kernel: LDISKFS-fs error (device dm-17): mb_free_blocks: double-free of inode 232644664''s block 930695944(bit 19208 in group 28402) Dec 17 14:40:56 oss01 kernel: Dec 17 14:40:56 oss01 kernel: LDISKFS-fs error (device dm-17): mb_free_blocks: double-free of inode 232644664''s block 930695945(bit 19209 in group 28402) Dec 17 14:40:56 oss01 kernel: Dec 17 14:40:56 oss01 kernel: LDISKFS-fs error (device dm-17): mb_free_blocks: double-free of inode 232644664''s block 930695946(bit 19210 in group 28402) Dec 17 14:40:56 oss01 kernel: Dec 17 14:40:56 oss01 kernel: LDISKFS-fs error (device dm-17): mb_free_blocks: double-free of inode 232644664''s block 930695947(bit 19211 in group 28402) Dec 17 14:40:56 oss01 kernel: Dec 17 14:40:56 oss01 kernel: LDISKFS-fs error (device dm-17): mb_free_blocks: double-free of inode 232644664''s block 930695948(bit 19212 in group 28402) Dec 17 14:40:56 oss01 kernel: Dec 17 14:40:56 oss01 kernel: LDISKFS-fs error (device dm-17): mb_free_blocks: double-free of inode 232644664''s block 930695949(bit 19213 in group 28402) Dec 17 14:40:56 oss01 kernel: Dec 17 14:40:56 oss01 kernel: LDISKFS-fs error (device dm-17): mb_free_blocks: double-free of inode 232644664''s block 930695950(bit 19214 in group 28402) Dec 17 14:40:56 oss01 kernel: Dec 17 14:40:56 oss01 kernel: LDISKFS-fs error (device dm-17): mb_free_blocks: double-free of inode 232644664''s block 930695951(bit 19215 in group 28402) (....) Dec 17 14:41:17 oss01 kernel: Dec 17 14:41:17 oss01 kernel: LDISKFS-fs error (device dm-16): mb_free_blocks: double-free of inode 214925368''s block 859725308(bit 24060 in group 26236) Dec 17 14:41:17 oss01 kernel: Dec 17 14:41:17 oss01 kernel: LDISKFS-fs error (device dm-16): mb_free_blocks: double-free of inode 214925368''s block 859725309(bit 24061 in group 26236) Dec 17 14:41:17 oss01 kernel: Dec 17 14:41:17 oss01 kernel: LDISKFS-fs error (device dm-16): mb_free_blocks: double-free of inode 214925368''s block 859725310(bit 24062 in group 26236) Dec 17 14:41:17 oss01 kernel: Dec 17 14:41:17 oss01 kernel: LDISKFS-fs error (device dm-16): mb_free_blocks: double-free of inode 214925368''s block 859725311(bit 24063 in group 26236) Dec 17 14:41:17 oss01 kernel: Dec 17 14:41:17 oss01 kernel: LustreError: 759:0:(fsfilt-ldiskfs.c: 281:fsfilt_ldiskfs_start()) error starting handle for op 1 (120 credits): rc -30 Dec 17 15:22:13 oss01 heartbeat: [26083]: info: Checking status of STONITH device [external/ipmi ] Dec 17 15:22:13 oss01 heartbeat: [32011]: info: Exiting STONITH-stat process 26083 returned rc 0. Dec 17 15:35:05 oss01 kernel: LustreError: 675:0:(ldlm_resource.c: 651:ldlm_resource_add()) lvbo_init failed for resource 94: rc -2 Dec 17 15:35:05 oss01 kernel: LustreError: 726:0:(ldlm_resource.c: 651:ldlm_resource_add()) lvbo_init failed for resource 95: rc -2 Dec 17 15:35:05 oss01 kernel: LustreError: 726:0:(ldlm_resource.c: 651:ldlm_resource_add()) Skipped 1 previous similar message Dec 17 15:35:05 oss01 kernel: LustreError: 698:0:(ldlm_resource.c: 651:ldlm_resource_add()) lvbo_init failed for resource 94: rc -2 Dec 17 15:35:05 oss01 kernel: LustreError: 698:0:(ldlm_resource.c: 651:ldlm_resource_add()) Skipped 1 previous similar message Dec 17 15:35:05 oss01 kernel: LustreError: 739:0:(ldlm_resource.c: 651:ldlm_resource_add()) lvbo_init failed for resource 97: rc -2 Dec 17 15:35:05 oss01 kernel: LustreError: 739:0:(ldlm_resource.c: 651:ldlm_resource_add()) Skipped 4 previous similar messages Dec 17 15:35:05 oss01 kernel: LustreError: 712:0:(ldlm_resource.c: 651:ldlm_resource_add()) lvbo_init failed for resource 95: rc -2 Dec 17 15:35:05 oss01 kernel: LustreError: 712:0:(ldlm_resource.c: 651:ldlm_resource_add()) Skipped 4 previous similar messages Dec 17 15:35:05 oss01 kernel: LustreError: 670:0:(ldlm_resource.c: 651:ldlm_resource_add()) lvbo_init failed for resource 96: rc -2 Dec 17 15:35:05 oss01 kernel: LustreError: 670:0:(ldlm_resource.c: 651:ldlm_resource_add()) Skipped 14 previous similar messages Dec 17 15:35:16 oss01 kernel: LustreError: 639:0:(ldlm_resource.c: 651:ldlm_resource_add()) lvbo_init failed for resource 98: rc -2 Dec 17 15:35:16 oss01 kernel: LustreError: 639:0:(ldlm_resource.c: 651:ldlm_resource_add()) Skipped 6 previous similar messages Dec 17 15:54:53 oss01 kernel: LustreError: 777:0:(fsfilt-ldiskfs.c: 281:fsfilt_ldiskfs_start()) error starting handle for op 8 (49 credits): rc -30 Dec 17 15:54:53 oss01 kernel: LustreError: 799:0:(fsfilt-ldiskfs.c: 281:fsfilt_ldiskfs_start()) error starting handle for op 8 (49 credits): rc -30 Dec 17 15:54:53 oss01 kernel: LustreError: 799:0:(fsfilt-ldiskfs.c: 281:fsfilt_ldiskfs_start()) Skipped 2 previous similar messages Dec 17 15:54:53 oss01 kernel: LustreError: 830:0:(fsfilt-ldiskfs.c: 281:fsfilt_ldiskfs_start()) error starting handle for op 8 (49 credits): rc -30 Dec 17 15:54:53 oss01 kernel: LustreError: 830:0:(fsfilt-ldiskfs.c: 281:fsfilt_ldiskfs_start()) Skipped 2 previous similar messages Dec 17 15:54:53 oss01 kernel: LustreError: 860:0:(fsfilt-ldiskfs.c: 281:fsfilt_ldiskfs_start()) error starting handle for op 8 (49 credits): rc -30 Dec 17 15:54:53 oss01 kernel: LustreError: 860:0:(fsfilt-ldiskfs.c: 281:fsfilt_ldiskfs_start()) Skipped 2 previous similar messages Dec 17 15:54:54 oss01 kernel: LustreError: 809:0:(fsfilt-ldiskfs.c: 281:fsfilt_ldiskfs_start()) error starting handle for op 8 (49 credits): rc -30 Dec 17 15:54:54 oss01 kernel: LustreError: 809:0:(fsfilt-ldiskfs.c: 281:fsfilt_ldiskfs_start()) Skipped 2 previous similar messages Dec 17 15:54:54 oss01 kernel: LustreError: 859:0:(fsfilt-ldiskfs.c: 281:fsfilt_ldiskfs_start()) error starting handle for op 8 (49 credits): rc -30 Dec 17 15:54:54 oss01 kernel: LustreError: 859:0:(fsfilt-ldiskfs.c: 281:fsfilt_ldiskfs_start()) Skipped 5 previous similar messages I solve the problem by umounting and remounting the 3 OSTs. Is it a bug relative to 1.6.3 ? ext4 ? What is the status for 1.6.4.1 ? Best Regards, Franck
On Tue, 2007-12-18 at 14:44 +0100, Franck Martinaux wrote:> In /var/log/messages we have : > > Dec 17 14:40:56 oss01 kernel: LDISKFS-fs error (device dm-15): > read_block_bitmap: Invalid block bitmap - block_group = 10648, block = > 348913664This is disk/fs corruption. Did you have any (hardware) disk errors?> I solve the problem by umounting and remounting the 3 OSTs.I would suggest you unmount the affected devices and force an fsck on them with: fsck -f <device>> What is the status for 1.6.4.1 ?It was announced released on this list about 12 or so hours ago. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20071218/9b3adfa8/attachment-0002.bin
Le 18 d?c. 07 ? 17:54, Brian J. Murrell a ?crit :> On Tue, 2007-12-18 at 14:44 +0100, Franck Martinaux wrote: >> In /var/log/messages we have : >> >> Dec 17 14:40:56 oss01 kernel: LDISKFS-fs error (device dm-15): >> read_block_bitmap: Invalid block bitmap - block_group = 10648, >> block >> 348913664 > > This is disk/fs corruption. Did you have any (hardware) disk errors? >No disk error at all, the hardware is reporting nothing and is perfectly healthy.>> I solve the problem by umounting and remounting the 3 OSTs. > > I would suggest you unmount the affected devices and force an fsck on > them with: > > fsck -f <device>Ok> > >> What is the status for 1.6.4.1 ? > > It was announced released on this list about 12 or so hours ago. >Thanks for your feedback, appreciate. Franck> b. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
Hi all, Thanks for the explanation Does it mean I need to open a bug at bugzilla.lustre.org ? Thanks for the support, Franck Le 18 d?c. 07 ? 17:54, Brian J. Murrell a ?crit :> On Tue, 2007-12-18 at 14:44 +0100, Franck Martinaux wrote: >> In /var/log/messages we have : >> >> Dec 17 14:40:56 oss01 kernel: LDISKFS-fs error (device dm-15): >> read_block_bitmap: Invalid block bitmap - block_group = 10648, >> block >> 348913664 > > This is disk/fs corruption. Did you have any (hardware) disk errors? > >> I solve the problem by umounting and remounting the 3 OSTs. > > I would suggest you unmount the affected devices and force an fsck on > them with: > > fsck -f <device> > >> What is the status for 1.6.4.1 ? > > It was announced released on this list about 12 or so hours ago. > > b. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
On Thu, 2007-12-20 at 14:29 +0100, Franck Martinaux wrote:> Hi all, > > Thanks for the explanation > > Does it mean I need to open a bug at bugzilla.lustre.org ?I don''t (yet) see an indication of a bug. As I said, most likely cause is disk error(s). If you can rule that out, then you might have a bug which you can file in bugzilla with much more context (i.e. syslog covering the event and 24h prior). b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20071220/035ddc36/attachment-0002.bin