Christian van Barneveld
2009-May-16 13:55 UTC
[Ocfs2-users] Filesystem corruption and OCFS2 errors
Hi, Our OCFS2 cluster has been stable for approx 8 months, but since this week it went wrong. First we had high load problems. We had this problem because a couple of directories got filled with files, 1 directory over 1,5 milion files (symlinks) and NFS (mounts are exported with NFS) caused high load because of that. Dir listing wasn't posible anymore. I cleaned up the directories and after that the load became normal again and everything seemed to be fine. But within a day our customer reported continuous disappearance of files. Those files where not from directories that I have cleaned, but random at the filesystem. There are also files that are not accesible anymore and a readonly FSCK showed some inode errors. We have 3 OCFS2 filesystems mounted and 2 of them had problems. Last night I brought down the cluster, unmount the filesystems and did a filesystem check. The 2 affected filesystems reported several errors like: [DIRENT_INODE_FREE] Directory entry 'f5377cd11ee628fe7c76c7f5b47f3bee.jpg' refers to inode number 811823124 which isn't allocated, clear the entry? <y> y [INODE_ORPHANED] Inode 800661759 was found in the orphan directory. Delete its contents and unlink it? <y> y I fixed the 2 filesystems which had problems and decided to check the (thirth) filesystem which had no problems and after that something went terribly wrong. First error was like this: [SUPERBLOCK_CLUSTERS] Superblock has clusters set to 40959872 instead of 999936 recorded in global_bitmap, it may be caused by an unsuccessful resize. Trust global_bitmap? <y> And I think I have given the wrong answer. After that a lot of Inode errors and when it finished there was no data anymore! Also after a remount the filesystem is not 2.5 TB, but 500 GB. LVM is used to create a 2,5 TB filesystem of one 2 TB LUN and a 500 GB LUN: VG Size 2.44 TB But fdisk says: Disk /dev/mapper/vg04-FS1: 485.3 GB, 485322915840 bytes OCFS2: number of blocks: 118702080 bytes per block: 4096 number of clusters: 7418880 bytes per cluster: 65536 After that I tried: tunefs.ocfs2 -S /dev/vg04/FS1 tunefs.ocfs2 1.4.1 tunefs.ocfs2: Cannot shrink volume size from 118702080 blocks to 118487040 blocks tunefs.ocfs2: Nothing to do. Exiting. But no results Is there anything I can do to fix this? I have tried a lot of things, but without results. I also tried a new kernel (2.6.29.3), but after booting and mounting it crashed (dm-17 is NOT the corrupted 3rth filesystem, but the second which had no problems anymore): May 15 23:47:31 fileserver-1 kernel: OCFS2: ERROR (device dm-17): ocfs2_validate_inode_block: Invalid dinode #664384: signature = u^P??\?z May 15 23:47:31 fileserver-1 kernel: May 15 23:47:31 fileserver-1 kernel: File system is now read-only due to the potential of on-disk corruption. Please run fsck.ocfs2 once the file system is unmounted. May 15 23:47:31 fileserver-1 kernel: (14610,1):ocfs2_read_locked_inode:466 ERROR: status = -22 May 15 23:47:31 fileserver-1 kernel: dm-17: rw=0, want=6635799728, limit=419422208 May 15 23:47:31 fileserver-1 kernel: (14613,0):ocfs2_read_locked_inode:466 ERROR: status = -5 May 15 23:47:31 fileserver-1 kernel: OCFS2: ERROR (device dm-17): ocfs2_validate_inode_block: Invalid dinode #659658: signature = ^Bu^S??\237\235 May 15 23:47:31 fileserver-1 kernel: May 15 23:47:31 fileserver-1 kernel: (14606,1):ocfs2_read_locked_inode:466 ERROR: status = -22 May 15 23:47:31 fileserver-1 kernel: attempt to access beyond end of device May 15 23:47:32 fileserver-1 kernel: dm-17: rw=0, want=6635799728, limit=419422208 May 15 23:47:32 fileserver-1 kernel: (14613,0):ocfs2_read_locked_inode:466 ERROR: status = -5 May 15 23:47:32 fileserver-1 kernel: attempt to access beyond end of device May 15 23:47:32 fileserver-1 kernel: dm-17: rw=0, want=6635799728, limit=419422208 May 15 23:47:32 fileserver-1 kernel: (14612,1):ocfs2_read_locked_inode:466 ERROR: status = -5 May 15 23:47:32 fileserver-1 kernel: attempt to access beyond end of device May 15 23:47:32 fileserver-1 kernel: dm-17: rw=0, want=6635799728, limit=419422208 May 15 23:47:32 fileserver-1 kernel: (14611,0):ocfs2_read_locked_inode:466 ERROR: status = -5 May 15 23:47:33 fileserver-1 kernel: attempt to access beyond end of device May 15 23:47:33 fileserver-1 kernel: dm-17: rw=0, want=6635799728, limit=419422208 May 15 23:47:33 fileserver-1 kernel: (14612,0):ocfs2_read_locked_inode:466 ERROR: status = -5 May 15 23:47:33 fileserver-1 kernel: attempt to access beyond end of device May 15 23:47:33 fileserver-1 kernel: dm-17: rw=0, want=6635799728, limit=419422208 May 15 23:47:34 fileserver-1 kernel: (14611,0):ocfs2_read_locked_inode:466 ERROR: status = -5 May 15 23:47:34 fileserver-1 kernel: OCFS2: ERROR (device dm-17): ocfs2_validate_inode_block: Invalid dinode #664384: signature = u^P??\?z May 15 23:47:34 fileserver-1 kernel: May 15 23:47:34 fileserver-1 kernel: (14613,1):ocfs2_read_locked_inode:466 ERROR: status = -22 May 15 23:47:34 fileserver-1 kernel: OCFS2: ERROR (device dm-17): ocfs2_validate_inode_block: Invalid dinode #659658: signature = ^Bu^S??\237\235 May 15 23:47:34 fileserver-1 kernel: May 15 23:47:34 fileserver-1 kernel: (14610,1):ocfs2_read_locked_inode:466 ERROR: status = -22 May 15 23:47:34 fileserver-1 kernel: OCFS2: ERROR (device dm-17): ocfs2_validate_inode_block: Invalid dinode #664384: signature = u^P??\?z May 15 23:47:34 fileserver-1 kernel: May 15 23:47:34 fileserver-1 kernel: (14612,1):ocfs2_read_locked_inode:466 ERROR: status = -22 May 15 23:47:34 fileserver-1 kernel: attempt to access beyond end of device May 15 23:47:34 fileserver-1 kernel: dm-17: rw=0, want=6635799728, limit=419422208 May 15 23:47:34 fileserver-1 kernel: (14611,0):ocfs2_read_locked_inode:466 ERROR: status = -5 May 15 23:47:34 fileserver-1 kernel: attempt to access beyond end of device May 15 23:47:34 fileserver-1 kernel: dm-17: rw=0, want=6635799728, limit=419422208 May 15 23:47:34 fileserver-1 kernel: (14612,1):ocfs2_read_locked_inode:466 ERROR: status = -5 May 15 23:59:30 fileserver-1 kernel: (10887,0):ocfs2_orphan_del:1978 ERROR: status = -2 May 15 23:59:30 fileserver-1 kernel: (10887,0):ocfs2_remove_inode:619 ERROR: status = -2 May 15 23:59:30 fileserver-1 kernel: (10887,0):ocfs2_wipe_inode:753 ERROR: status = -2 May 15 23:59:30 fileserver-1 kernel: (10887,0):ocfs2_delete_inode:990 ERROR: status = -2 May 16 00:28:39 fileserver-1 kernel: ocfs2_dlm: Nodes in domain ("296B7CF537094A9BA5F193A426D92440"): 0 May 16 00:40:19 fileserver-1 kernel: ------------[ cut here ]------------ May 16 00:40:19 fileserver-1 kernel: kernel BUG at fs/ocfs2/inode.c:244! May 16 00:40:19 fileserver-1 kernel: invalid opcode: 0000 [#1] SMP May 16 00:40:19 fileserver-1 kernel: last sysfs file: /sys/fs/o2cb/interface_revision May 16 00:40:19 fileserver-1 kernel: Modules linked in: ocfs2 jbd2 xt_multiport nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs dm_round_robin scsi_dh_rdac dm_multipath dm_mod scsi_dh qla2xxx May 16 00:40:19 fileserver-1 kernel: May 16 00:40:19 fileserver-1 kernel: Pid: 14609, comm: nfsd Not tainted (2.6.29.3-amd-mods-qla2xxx-mpath-fw-cluster-hm64 #1) Sun Fire V40z May 16 00:40:19 fileserver-1 kernel: EIP: 0060:[<fa8c2580>] EFLAGS: 00010246 CPU: 0 May 16 00:40:19 fileserver-1 kernel: EIP is at ocfs2_populate_inode+0x550/0x560 [ocfs2] May 16 00:40:19 fileserver-1 kernel: EAX: 00000000 EBX: f49ae000 ECX: 00000000 EDX: fa9002aa May 16 00:40:19 fileserver-1 kernel: ESI: e44eddfc EDI: f66f1000 EBP: f2821cb8 ESP: f2821c6c May 16 00:40:19 fileserver-1 kernel: DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 May 16 00:40:19 fileserver-1 kernel: Process nfsd (pid: 14609, ti=f2820000 task=f6660080 task.ti=f2820000) May 16 00:40:19 fileserver-1 kernel: Stack: May 16 00:40:19 fileserver-1 kernel: 00000001 00000000 e44eda80 00000000 00000000 e44eddfc 00000001 f2821cac May 16 00:40:19 fileserver-1 kernel: f2821cf4 00000001 f2821cb8 00000000 00000001 f2821cac 00000000 fa8c07f0 May 16 00:40:19 fileserver-1 kernel: f66f1000 e44eddfc 00000001 f2821d04 fa8c2b7b 00000000 f2821ce0 f3d0b0c0 May 16 00:40:19 fileserver-1 kernel: Call Trace: May 16 00:40:19 fileserver-1 kernel: [<fa8c07f0>] ? ocfs2_validate_inode_block+0x0/0x280 [ocfs2] May 16 00:40:19 fileserver-1 kernel: [<fa8c2b7b>] ? ocfs2_iget+0x5eb/0x930 [ocfs2] May 16 00:40:19 fileserver-1 kernel: [<fa8b708a>] ? ocfs2_get_dentry+0x9a/0x1e0 [ocfs2] May 16 00:40:19 fileserver-1 kernel: [<c04d80d2>] ? skb_copy_datagram_iovec+0x132/0x1d0 May 16 00:40:19 fileserver-1 kernel: [<fa8b7277>] ? ocfs2_fh_to_dentry+0x47/0x60 [ocfs2] May 16 00:40:19 fileserver-1 kernel: [<c0251cc5>] ? exportfs_decode_fh+0x35/0x1f0 May 16 00:40:19 fileserver-1 kernel: [<c02c470f>] ? security_task_setgroups+0xf/0x20 May 16 00:40:19 fileserver-1 kernel: [<c0132de6>] ? set_groups+0x16/0x1f0 May 16 00:40:19 fileserver-1 kernel: [<c057794d>] ? cache_check+0x2d/0x3e0 May 16 00:40:19 fileserver-1 kernel: [<c013305a>] ? groups_alloc+0x3a/0xc0 May 16 00:40:19 fileserver-1 kernel: [<c025babc>] ? nfsd_setuser+0x17c/0x360 May 16 00:40:19 fileserver-1 kernel: [<c0254bca>] ? nfsd_setuser_and_check_port+0x5a/0x60 May 16 00:40:19 fileserver-1 kernel: [<c02599c4>] ? exp_find+0x54/0x80 May 16 00:40:19 fileserver-1 kernel: [<c0259a26>] ? rqst_exp_find+0x36/0xd0 May 16 00:40:19 fileserver-1 kernel: [<c0254fe4>] ? fh_verify+0x414/0x650 May 16 00:40:19 fileserver-1 kernel: [<c02556f0>] ? nfsd_acceptable+0x0/0xe0 May 16 00:40:19 fileserver-1 kernel: [<c011fa3b>] ? default_wake_function+0xb/0x10 May 16 00:40:19 fileserver-1 kernel: [<c057794d>] ? cache_check+0x2d/0x3e0 May 16 00:40:19 fileserver-1 kernel: [<c025d6f9>] ? nfsd3_proc_getattr+0x69/0xe0 May 16 00:40:19 fileserver-1 kernel: [<c025fbb0>] ? nfs3svc_decode_fhandle+0x0/0x40 May 16 00:40:19 fileserver-1 kernel: [<c025fbb0>] ? nfs3svc_decode_fhandle+0x0/0x40 May 16 00:40:19 fileserver-1 kernel: [<c025208a>] ? nfsd_dispatch+0x9a/0x220 May 16 00:40:19 fileserver-1 kernel: [<c0251ff0>] ? nfsd_dispatch+0x0/0x220 May 16 00:40:19 fileserver-1 kernel: [<c057106b>] ? svc_process+0x3eb/0x6c0 May 16 00:40:19 fileserver-1 kernel: [<c0252746>] ? nfsd+0x136/0x240 May 16 00:40:19 fileserver-1 kernel: [<c011c5d8>] ? complete+0x48/0x60 May 16 00:40:19 fileserver-1 kernel: [<c0252610>] ? nfsd+0x0/0x240 May 16 00:40:19 fileserver-1 kernel: [<c0138972>] ? kthread+0x42/0x70 May 16 00:40:19 fileserver-1 kernel: [<c0138930>] ? kthread+0x0/0x70 May 16 00:40:19 fileserver-1 kernel: [<c010389b>] ? kernel_thread_helper+0x7/0x1c May 16 00:40:19 fileserver-1 kernel: Code: 8f fa 85 d2 ba 20 dc 8f fa 0f 44 c2 89 86 9c 00 00 00 e9 39 ff ff ff 83 8e 44 01 00 00 20 e9 a1 fc ff ff 0f 0b eb fe 0f 0b eb fe <0f> 0b eb fe 0f 0b eb fe 90 8d b4 26 00 00 00 00 55 89 e5 57 56 May 16 00:40:19 fileserver-1 kernel: EIP: [<fa8c2580>] ocfs2_populate_inode+0x550/0x560 [ocfs2] SS:ESP 0068:f2821c6c May 16 00:40:19 fileserver-1 kernel: ---[ end trace 3b05f9cfd74396a1 ]--- NFS with OCFS2 problems? I went back to my previous kernel 2.6.25.5 and it seemed to be stable. At this moment I have 2 mounted (production) filesystems and 1 umounted corrupted filesystem. This morning I looked in the logs and again errors! Many like this: (249,1):ocfs2_orphan_del:1869 ERROR: status = -2 (249,1):ocfs2_remove_inode:610 ERROR: status = -2 (249,1):ocfs2_wipe_inode:736 ERROR: status = -2 (249,1):ocfs2_delete_inode:970 ERROR: status = -2 This came from the 2 filesystems that seemed to be clean last night. - What can I do to prevent filesystem corruption on my 2 production OCFS2 filesystems and get rid of the above errors? - Is it possible to fix the corrupted thirth filesystem? - What is the most stable kernel (or setup) in my case? Now (and the last year) I am using 2.6.25.5. The 2.6.29.3 kernel I've tried crashed after a couple of minutes. Versions: OS: Debian Etch (4.0) kernel: custom 2.6.25.5 o2cb_ctl version 1.4.1 ocfs2-tools 1.4.1 OCFS2 DLM 1.5.0 OCFS2 DLMFS 1.5.0 I hope that you can help me with these problems. Best regards, Christian van Barneveld
Christian van Barneveld
2009-May-20 06:26 UTC
[Ocfs2-users] Filesystem corruption and OCFS2 errors
Hi Sunil, Joel, No any clue about this? I still have 1 corrupted filesystem (shrinked from 2,5 TB to 500 GB after FSCK) and still errors on the 2 other mounts: (249,1):ocfs2_orphan_del:1869 ERROR: status = -2 (249,1):ocfs2_remove_inode:610 ERROR: status = -2 (249,1):ocfs2_wipe_inode:736 ERROR: status = -2 (249,1):ocfs2_delete_inode:970 ERROR: status = -2 Can NFS be a problem? And what about the kernel panic (2.6.29)? I hope you can help me out with this. Regards, Christian ________________________________________ Van: ocfs2-users-bounces at oss.oracle.com [ocfs2-users-bounces at oss.oracle.com] namens Christian van Barneveld [c.van.barneveld at zx.nl] Verzonden: zaterdag 16 mei 2009 15:55 Aan: ocfs2-users at oss.oracle.com Onderwerp: [Ocfs2-users] Filesystem corruption and OCFS2 errors Hi, Our OCFS2 cluster has been stable for approx 8 months, but since this week it went wrong. First we had high load problems. We had this problem because a couple of directories got filled with files, 1 directory over 1,5 milion files (symlinks) and NFS (mounts are exported with NFS) caused high load because of that. Dir listing wasn't posible anymore. I cleaned up the directories and after that the load became normal again and everything seemed to be fine. But within a day our customer reported continuous disappearance of files. Those files where not from directories that I have cleaned, but random at the filesystem. There are also files that are not accesible anymore and a readonly FSCK showed some inode errors. We have 3 OCFS2 filesystems mounted and 2 of them had problems. Last night I brought down the cluster, unmount the filesystems and did a filesystem check. The 2 affected filesystems reported several errors like: [DIRENT_INODE_FREE] Directory entry 'f5377cd11ee628fe7c76c7f5b47f3bee.jpg' refers to inode number 811823124 which isn't allocated, clear the entry? <y> y [INODE_ORPHANED] Inode 800661759 was found in the orphan directory. Delete its contents and unlink it? <y> y I fixed the 2 filesystems which had problems and decided to check the (thirth) filesystem which had no problems and after that something went terribly wrong. First error was like this: [SUPERBLOCK_CLUSTERS] Superblock has clusters set to 40959872 instead of 999936 recorded in global_bitmap, it may be caused by an unsuccessful resize. Trust global_bitmap? <y> And I think I have given the wrong answer. After that a lot of Inode errors and when it finished there was no data anymore! Also after a remount the filesystem is not 2.5 TB, but 500 GB. LVM is used to create a 2,5 TB filesystem of one 2 TB LUN and a 500 GB LUN: VG Size 2.44 TB But fdisk says: Disk /dev/mapper/vg04-FS1: 485.3 GB, 485322915840 bytes OCFS2: number of blocks: 118702080 bytes per block: 4096 number of clusters: 7418880 bytes per cluster: 65536 After that I tried: tunefs.ocfs2 -S /dev/vg04/FS1 tunefs.ocfs2 1.4.1 tunefs.ocfs2: Cannot shrink volume size from 118702080 blocks to 118487040 blocks tunefs.ocfs2: Nothing to do. Exiting. But no results Is there anything I can do to fix this? I have tried a lot of things, but without results. I also tried a new kernel (2.6.29.3), but after booting and mounting it crashed (dm-17 is NOT the corrupted 3rth filesystem, but the second which had no problems anymore): May 15 23:47:31 fileserver-1 kernel: OCFS2: ERROR (device dm-17): ocfs2_validate_inode_block: Invalid dinode #664384: signature = u^P??\?z May 15 23:47:31 fileserver-1 kernel: May 15 23:47:31 fileserver-1 kernel: File system is now read-only due to the potential of on-disk corruption. Please run fsck.ocfs2 once the file system is unmounted. May 15 23:47:31 fileserver-1 kernel: (14610,1):ocfs2_read_locked_inode:466 ERROR: status = -22 May 15 23:47:31 fileserver-1 kernel: dm-17: rw=0, want=6635799728, limit=419422208 May 15 23:47:31 fileserver-1 kernel: (14613,0):ocfs2_read_locked_inode:466 ERROR: status = -5 May 15 23:47:31 fileserver-1 kernel: OCFS2: ERROR (device dm-17): ocfs2_validate_inode_block: Invalid dinode #659658: signature = ^Bu^S??\237\235 May 15 23:47:31 fileserver-1 kernel: May 15 23:47:31 fileserver-1 kernel: (14606,1):ocfs2_read_locked_inode:466 ERROR: status = -22 May 15 23:47:31 fileserver-1 kernel: attempt to access beyond end of device May 15 23:47:32 fileserver-1 kernel: dm-17: rw=0, want=6635799728, limit=419422208 May 15 23:47:32 fileserver-1 kernel: (14613,0):ocfs2_read_locked_inode:466 ERROR: status = -5 May 15 23:47:32 fileserver-1 kernel: attempt to access beyond end of device May 15 23:47:32 fileserver-1 kernel: dm-17: rw=0, want=6635799728, limit=419422208 May 15 23:47:32 fileserver-1 kernel: (14612,1):ocfs2_read_locked_inode:466 ERROR: status = -5 May 15 23:47:32 fileserver-1 kernel: attempt to access beyond end of device May 15 23:47:32 fileserver-1 kernel: dm-17: rw=0, want=6635799728, limit=419422208 May 15 23:47:32 fileserver-1 kernel: (14611,0):ocfs2_read_locked_inode:466 ERROR: status = -5 May 15 23:47:33 fileserver-1 kernel: attempt to access beyond end of device May 15 23:47:33 fileserver-1 kernel: dm-17: rw=0, want=6635799728, limit=419422208 May 15 23:47:33 fileserver-1 kernel: (14612,0):ocfs2_read_locked_inode:466 ERROR: status = -5 May 15 23:47:33 fileserver-1 kernel: attempt to access beyond end of device May 15 23:47:33 fileserver-1 kernel: dm-17: rw=0, want=6635799728, limit=419422208 May 15 23:47:34 fileserver-1 kernel: (14611,0):ocfs2_read_locked_inode:466 ERROR: status = -5 May 15 23:47:34 fileserver-1 kernel: OCFS2: ERROR (device dm-17): ocfs2_validate_inode_block: Invalid dinode #664384: signature = u^P??\?z May 15 23:47:34 fileserver-1 kernel: May 15 23:47:34 fileserver-1 kernel: (14613,1):ocfs2_read_locked_inode:466 ERROR: status = -22 May 15 23:47:34 fileserver-1 kernel: OCFS2: ERROR (device dm-17): ocfs2_validate_inode_block: Invalid dinode #659658: signature = ^Bu^S??\237\235 May 15 23:47:34 fileserver-1 kernel: May 15 23:47:34 fileserver-1 kernel: (14610,1):ocfs2_read_locked_inode:466 ERROR: status = -22 May 15 23:47:34 fileserver-1 kernel: OCFS2: ERROR (device dm-17): ocfs2_validate_inode_block: Invalid dinode #664384: signature = u^P??\?z May 15 23:47:34 fileserver-1 kernel: May 15 23:47:34 fileserver-1 kernel: (14612,1):ocfs2_read_locked_inode:466 ERROR: status = -22 May 15 23:47:34 fileserver-1 kernel: attempt to access beyond end of device May 15 23:47:34 fileserver-1 kernel: dm-17: rw=0, want=6635799728, limit=419422208 May 15 23:47:34 fileserver-1 kernel: (14611,0):ocfs2_read_locked_inode:466 ERROR: status = -5 May 15 23:47:34 fileserver-1 kernel: attempt to access beyond end of device May 15 23:47:34 fileserver-1 kernel: dm-17: rw=0, want=6635799728, limit=419422208 May 15 23:47:34 fileserver-1 kernel: (14612,1):ocfs2_read_locked_inode:466 ERROR: status = -5 May 15 23:59:30 fileserver-1 kernel: (10887,0):ocfs2_orphan_del:1978 ERROR: status = -2 May 15 23:59:30 fileserver-1 kernel: (10887,0):ocfs2_remove_inode:619 ERROR: status = -2 May 15 23:59:30 fileserver-1 kernel: (10887,0):ocfs2_wipe_inode:753 ERROR: status = -2 May 15 23:59:30 fileserver-1 kernel: (10887,0):ocfs2_delete_inode:990 ERROR: status = -2 May 16 00:28:39 fileserver-1 kernel: ocfs2_dlm: Nodes in domain ("296B7CF537094A9BA5F193A426D92440"): 0 May 16 00:40:19 fileserver-1 kernel: ------------[ cut here ]------------ May 16 00:40:19 fileserver-1 kernel: kernel BUG at fs/ocfs2/inode.c:244! May 16 00:40:19 fileserver-1 kernel: invalid opcode: 0000 [#1] SMP May 16 00:40:19 fileserver-1 kernel: last sysfs file: /sys/fs/o2cb/interface_revision May 16 00:40:19 fileserver-1 kernel: Modules linked in: ocfs2 jbd2 xt_multiport nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs dm_round_robin scsi_dh_rdac dm_multipath dm_mod scsi_dh qla2xxx May 16 00:40:19 fileserver-1 kernel: May 16 00:40:19 fileserver-1 kernel: Pid: 14609, comm: nfsd Not tainted (2.6.29.3-amd-mods-qla2xxx-mpath-fw-cluster-hm64 #1) Sun Fire V40z May 16 00:40:19 fileserver-1 kernel: EIP: 0060:[<fa8c2580>] EFLAGS: 00010246 CPU: 0 May 16 00:40:19 fileserver-1 kernel: EIP is at ocfs2_populate_inode+0x550/0x560 [ocfs2] May 16 00:40:19 fileserver-1 kernel: EAX: 00000000 EBX: f49ae000 ECX: 00000000 EDX: fa9002aa May 16 00:40:19 fileserver-1 kernel: ESI: e44eddfc EDI: f66f1000 EBP: f2821cb8 ESP: f2821c6c May 16 00:40:19 fileserver-1 kernel: DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 May 16 00:40:19 fileserver-1 kernel: Process nfsd (pid: 14609, ti=f2820000 task=f6660080 task.ti=f2820000) May 16 00:40:19 fileserver-1 kernel: Stack: May 16 00:40:19 fileserver-1 kernel: 00000001 00000000 e44eda80 00000000 00000000 e44eddfc 00000001 f2821cac May 16 00:40:19 fileserver-1 kernel: f2821cf4 00000001 f2821cb8 00000000 00000001 f2821cac 00000000 fa8c07f0 May 16 00:40:19 fileserver-1 kernel: f66f1000 e44eddfc 00000001 f2821d04 fa8c2b7b 00000000 f2821ce0 f3d0b0c0 May 16 00:40:19 fileserver-1 kernel: Call Trace: May 16 00:40:19 fileserver-1 kernel: [<fa8c07f0>] ? ocfs2_validate_inode_block+0x0/0x280 [ocfs2] May 16 00:40:19 fileserver-1 kernel: [<fa8c2b7b>] ? ocfs2_iget+0x5eb/0x930 [ocfs2] May 16 00:40:19 fileserver-1 kernel: [<fa8b708a>] ? ocfs2_get_dentry+0x9a/0x1e0 [ocfs2] May 16 00:40:19 fileserver-1 kernel: [<c04d80d2>] ? skb_copy_datagram_iovec+0x132/0x1d0 May 16 00:40:19 fileserver-1 kernel: [<fa8b7277>] ? ocfs2_fh_to_dentry+0x47/0x60 [ocfs2] May 16 00:40:19 fileserver-1 kernel: [<c0251cc5>] ? exportfs_decode_fh+0x35/0x1f0 May 16 00:40:19 fileserver-1 kernel: [<c02c470f>] ? security_task_setgroups+0xf/0x20 May 16 00:40:19 fileserver-1 kernel: [<c0132de6>] ? set_groups+0x16/0x1f0 May 16 00:40:19 fileserver-1 kernel: [<c057794d>] ? cache_check+0x2d/0x3e0 May 16 00:40:19 fileserver-1 kernel: [<c013305a>] ? groups_alloc+0x3a/0xc0 May 16 00:40:19 fileserver-1 kernel: [<c025babc>] ? nfsd_setuser+0x17c/0x360 May 16 00:40:19 fileserver-1 kernel: [<c0254bca>] ? nfsd_setuser_and_check_port+0x5a/0x60 May 16 00:40:19 fileserver-1 kernel: [<c02599c4>] ? exp_find+0x54/0x80 May 16 00:40:19 fileserver-1 kernel: [<c0259a26>] ? rqst_exp_find+0x36/0xd0 May 16 00:40:19 fileserver-1 kernel: [<c0254fe4>] ? fh_verify+0x414/0x650 May 16 00:40:19 fileserver-1 kernel: [<c02556f0>] ? nfsd_acceptable+0x0/0xe0 May 16 00:40:19 fileserver-1 kernel: [<c011fa3b>] ? default_wake_function+0xb/0x10 May 16 00:40:19 fileserver-1 kernel: [<c057794d>] ? cache_check+0x2d/0x3e0 May 16 00:40:19 fileserver-1 kernel: [<c025d6f9>] ? nfsd3_proc_getattr+0x69/0xe0 May 16 00:40:19 fileserver-1 kernel: [<c025fbb0>] ? nfs3svc_decode_fhandle+0x0/0x40 May 16 00:40:19 fileserver-1 kernel: [<c025fbb0>] ? nfs3svc_decode_fhandle+0x0/0x40 May 16 00:40:19 fileserver-1 kernel: [<c025208a>] ? nfsd_dispatch+0x9a/0x220 May 16 00:40:19 fileserver-1 kernel: [<c0251ff0>] ? nfsd_dispatch+0x0/0x220 May 16 00:40:19 fileserver-1 kernel: [<c057106b>] ? svc_process+0x3eb/0x6c0 May 16 00:40:19 fileserver-1 kernel: [<c0252746>] ? nfsd+0x136/0x240 May 16 00:40:19 fileserver-1 kernel: [<c011c5d8>] ? complete+0x48/0x60 May 16 00:40:19 fileserver-1 kernel: [<c0252610>] ? nfsd+0x0/0x240 May 16 00:40:19 fileserver-1 kernel: [<c0138972>] ? kthread+0x42/0x70 May 16 00:40:19 fileserver-1 kernel: [<c0138930>] ? kthread+0x0/0x70 May 16 00:40:19 fileserver-1 kernel: [<c010389b>] ? kernel_thread_helper+0x7/0x1c May 16 00:40:19 fileserver-1 kernel: Code: 8f fa 85 d2 ba 20 dc 8f fa 0f 44 c2 89 86 9c 00 00 00 e9 39 ff ff ff 83 8e 44 01 00 00 20 e9 a1 fc ff ff 0f 0b eb fe 0f 0b eb fe <0f> 0b eb fe 0f 0b eb fe 90 8d b4 26 00 00 00 00 55 89 e5 57 56 May 16 00:40:19 fileserver-1 kernel: EIP: [<fa8c2580>] ocfs2_populate_inode+0x550/0x560 [ocfs2] SS:ESP 0068:f2821c6c May 16 00:40:19 fileserver-1 kernel: ---[ end trace 3b05f9cfd74396a1 ]--- NFS with OCFS2 problems? I went back to my previous kernel 2.6.25.5 and it seemed to be stable. At this moment I have 2 mounted (production) filesystems and 1 umounted corrupted filesystem. This morning I looked in the logs and again errors! Many like this: (249,1):ocfs2_orphan_del:1869 ERROR: status = -2 (249,1):ocfs2_remove_inode:610 ERROR: status = -2 (249,1):ocfs2_wipe_inode:736 ERROR: status = -2 (249,1):ocfs2_delete_inode:970 ERROR: status = -2 This came from the 2 filesystems that seemed to be clean last night. - What can I do to prevent filesystem corruption on my 2 production OCFS2 filesystems and get rid of the above errors? - Is it possible to fix the corrupted thirth filesystem? - What is the most stable kernel (or setup) in my case? Now (and the last year) I am using 2.6.25.5. The 2.6.29.3 kernel I've tried crashed after a couple of minutes. Versions: OS: Debian Etch (4.0) kernel: custom 2.6.25.5 o2cb_ctl version 1.4.1 ocfs2-tools 1.4.1 OCFS2 DLM 1.5.0 OCFS2 DLMFS 1.5.0 I hope that you can help me with these problems. Best regards, Christian van Barneveld _______________________________________________ Ocfs2-users mailing list Ocfs2-users at oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Christian van Barneveld wrote:> Our OCFS2 cluster has been stable for approx 8 months, but since this week it went wrong. First we had high load problems. We had this problem because a couple of directories got filled with files, 1 directory over 1,5 milion files (symlinks) and NFS (mounts are exported with NFS) caused high load because of that. Dir listing wasn't posible anymore. > I cleaned up the directories and after that the load became normal again and everything seemed to be fine. > > But within a day our customer reported continuous disappearance of files. Those files where not from directories that I have cleaned, but random at the filesystem. There are also files that are not accesible anymore and a readonly FSCK showed some inode errors. > We have 3 OCFS2 filesystems mounted and 2 of them had problems. Last night I brought down the cluster, unmount the filesystems and did a filesystem check. The 2 affected filesystems reported several errors like: > [DIRENT_INODE_FREE] Directory entry 'f5377cd11ee628fe7c76c7f5b47f3bee.jpg' refers to inode number 811823124 which isn't allocated, clear the entry? <y> y > [INODE_ORPHANED] Inode 800661759 was found in the orphan directory. Delete its contents and unlink it? <y> yWhen one issues a rm, we first remove the directory entry and add a corresponding entry in the orphan dir. Then we delete... free all the extents, inode bit, etc. In your case, the dir entry is there but the inode bit is free. Also, that inode is still in the orphan dir. Do you by chance save the fsck output? Wanted to see if all the inodes were co-located... meaning the error stemmed from a corrupt inode alloc bitmap.> I fixed the 2 filesystems which had problems and decided to check the (thirth) filesystem which had no problems and after that something went terribly wrong. > First error was like this: > [SUPERBLOCK_CLUSTERS] Superblock has clusters set to 40959872 instead of 999936 recorded in global_bitmap, it may be caused by an unsuccessful resize. Trust global_bitmap? <y> > And I think I have given the wrong answer. After that a lot of Inode errors and when it finished there was no data anymore!No, you did not give the wrong answer. More specifically, your answer did not cause the problem. That yes only set the size in the superblock the same as what was in the global bitmap. That's harmless. The qs is why that value in the global bitmap was so wrong. And this is one value we don't touch.... other than during resize. And we don't allow shrinking. So size in gloabl bitmap should never be smaller than the one in the superblock.> Also after a remount the filesystem is not 2.5 TB, but 500 GB. LVM is used to create a 2,5 TB filesystem of one 2 TB LUN and a 500 GB LUN: > VG Size 2.44 TB > But fdisk says: > Disk /dev/mapper/vg04-FS1: 485.3 GB, 485322915840 bytes > > OCFS2: > number of blocks: 118702080 > bytes per block: 4096 > number of clusters: 7418880 > bytes per cluster: 65536If fdisk is saying the device is 485G, then that's what the other tools will see. And this appears to be the root cause of your problem. LVM. There is a reason why we don't support LVM.> After that I tried: > tunefs.ocfs2 -S /dev/vg04/FS1 tunefs.ocfs2 1.4.1 tunefs.ocfs2: Cannot shrink volume size from 118702080 blocks to 118487040 blocks tunefs.ocfs2: Nothing to do. Exiting. > But no results > > Is there anything I can do to fix this? I have tried a lot of things, but without results.Best solution is to salvage your data using debugfs.ocfs2. It has commands like dump/rdump that read the files directly off the disk.> I also tried a new kernel (2.6.29.3), but after booting and mounting it crashed (dm-17 is NOT the corrupted 3rth filesystem, but the second which had no problems anymore): > > May 15 23:47:31 fileserver-1 kernel: OCFS2: ERROR (device dm-17): ocfs2_validate_inode_block: Invalid dinode #664384: signature = u^P??\?z > May 15 23:47:31 fileserver-1 kernel: > May 15 23:47:31 fileserver-1 kernel: File system is now read-only due to the potential of on-disk corruption. Please run fsck.ocfs2 once the file system is unmounted. > May 15 23:47:31 fileserver-1 kernel: (14610,1):ocfs2_read_locked_inode:466 ERROR: status = -22 > May 15 23:47:31 fileserver-1 kernel: dm-17: rw=0, want=6635799728, limit=419422208What this is saying is that the disk size has shrunk. It is trying to read 6G into the volume but the block layer is saying that the device is 500M only. You have to look into your block device setup.> May 15 23:47:31 fileserver-1 kernel: (14613,0):ocfs2_read_locked_inode:466 ERROR: status = -5 > May 15 23:47:31 fileserver-1 kernel: OCFS2: ERROR (device dm-17): ocfs2_validate_inode_block: Invalid dinode #659658: signature = ^Bu^S??\237\235 > May 15 23:47:31 fileserver-1 kernel: > May 15 23:47:31 fileserver-1 kernel: (14606,1):ocfs2_read_locked_inode:466 ERROR: status = -22 > May 15 23:47:31 fileserver-1 kernel: attempt to access beyond end of device > May 15 23:47:32 fileserver-1 kernel: dm-17: rw=0, want=6635799728, limit=419422208 > May 15 23:47:32 fileserver-1 kernel: (14613,0):ocfs2_read_locked_inode:466 ERROR: status = -5 > May 15 23:47:32 fileserver-1 kernel: attempt to access beyond end of device > May 15 23:47:32 fileserver-1 kernel: dm-17: rw=0, want=6635799728, limit=419422208 > May 15 23:47:32 fileserver-1 kernel: (14612,1):ocfs2_read_locked_inode:466 ERROR: status = -5 > May 15 23:47:32 fileserver-1 kernel: attempt to access beyond end of device > May 15 23:47:32 fileserver-1 kernel: dm-17: rw=0, want=6635799728, limit=419422208 > May 15 23:47:32 fileserver-1 kernel: (14611,0):ocfs2_read_locked_inode:466 ERROR: status = -5 > May 15 23:47:33 fileserver-1 kernel: attempt to access beyond end of device > May 15 23:47:33 fileserver-1 kernel: dm-17: rw=0, want=6635799728, limit=419422208 > May 15 23:47:33 fileserver-1 kernel: (14612,0):ocfs2_read_locked_inode:466 ERROR: status = -5 > May 15 23:47:33 fileserver-1 kernel: attempt to access beyond end of device > May 15 23:47:33 fileserver-1 kernel: dm-17: rw=0, want=6635799728, limit=419422208 > May 15 23:47:34 fileserver-1 kernel: (14611,0):ocfs2_read_locked_inode:466 ERROR: status = -5 > May 15 23:47:34 fileserver-1 kernel: OCFS2: ERROR (device dm-17): ocfs2_validate_inode_block: Invalid dinode #664384: signature = u^P??\?z > May 15 23:47:34 fileserver-1 kernel: > May 15 23:47:34 fileserver-1 kernel: (14613,1):ocfs2_read_locked_inode:466 ERROR: status = -22 > May 15 23:47:34 fileserver-1 kernel: OCFS2: ERROR (device dm-17): ocfs2_validate_inode_block: Invalid dinode #659658: signature = ^Bu^S??\237\235 > May 15 23:47:34 fileserver-1 kernel: > May 15 23:47:34 fileserver-1 kernel: (14610,1):ocfs2_read_locked_inode:466 ERROR: status = -22 > May 15 23:47:34 fileserver-1 kernel: OCFS2: ERROR (device dm-17): ocfs2_validate_inode_block: Invalid dinode #664384: signature = u^P??\?z > May 15 23:47:34 fileserver-1 kernel: > May 15 23:47:34 fileserver-1 kernel: (14612,1):ocfs2_read_locked_inode:466 ERROR: status = -22 > May 15 23:47:34 fileserver-1 kernel: attempt to access beyond end of device > May 15 23:47:34 fileserver-1 kernel: dm-17: rw=0, want=6635799728, limit=419422208 > May 15 23:47:34 fileserver-1 kernel: (14611,0):ocfs2_read_locked_inode:466 ERROR: status = -5 > May 15 23:47:34 fileserver-1 kernel: attempt to access beyond end of device > May 15 23:47:34 fileserver-1 kernel: dm-17: rw=0, want=6635799728, limit=419422208 > May 15 23:47:34 fileserver-1 kernel: (14612,1):ocfs2_read_locked_inode:466 ERROR: status = -5 > May 15 23:59:30 fileserver-1 kernel: (10887,0):ocfs2_orphan_del:1978 ERROR: status = -2 > May 15 23:59:30 fileserver-1 kernel: (10887,0):ocfs2_remove_inode:619 ERROR: status = -2 > May 15 23:59:30 fileserver-1 kernel: (10887,0):ocfs2_wipe_inode:753 ERROR: status = -2 > May 15 23:59:30 fileserver-1 kernel: (10887,0):ocfs2_delete_inode:990 ERROR: status = -2 > May 16 00:28:39 fileserver-1 kernel: ocfs2_dlm: Nodes in domain ("296B7CF537094A9BA5F193A426D92440"): 0 > > May 16 00:40:19 fileserver-1 kernel: ------------[ cut here ]------------ > May 16 00:40:19 fileserver-1 kernel: kernel BUG at fs/ocfs2/inode.c:244! > May 16 00:40:19 fileserver-1 kernel: invalid opcode: 0000 [#1] SMP > May 16 00:40:19 fileserver-1 kernel: last sysfs file: /sys/fs/o2cb/interface_revision > May 16 00:40:19 fileserver-1 kernel: Modules linked in: ocfs2 jbd2 xt_multiport nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs dm_round_robin scsi_dh_rdac dm_multipath dm_mod scsi_dh qla2xxx > May 16 00:40:19 fileserver-1 kernel: > May 16 00:40:19 fileserver-1 kernel: Pid: 14609, comm: nfsd Not tainted (2.6.29.3-amd-mods-qla2xxx-mpath-fw-cluster-hm64 #1) Sun Fire V40z > May 16 00:40:19 fileserver-1 kernel: EIP: 0060:[<fa8c2580>] EFLAGS: 00010246 CPU: 0 > May 16 00:40:19 fileserver-1 kernel: EIP is at ocfs2_populate_inode+0x550/0x560 [ocfs2] > May 16 00:40:19 fileserver-1 kernel: EAX: 00000000 EBX: f49ae000 ECX: 00000000 EDX: fa9002aa > May 16 00:40:19 fileserver-1 kernel: ESI: e44eddfc EDI: f66f1000 EBP: f2821cb8 ESP: f2821c6c > May 16 00:40:19 fileserver-1 kernel: DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 > May 16 00:40:19 fileserver-1 kernel: Process nfsd (pid: 14609, ti=f2820000 task=f6660080 task.ti=f2820000) > May 16 00:40:19 fileserver-1 kernel: Stack: > May 16 00:40:19 fileserver-1 kernel: 00000001 00000000 e44eda80 00000000 00000000 e44eddfc 00000001 f2821cac > May 16 00:40:19 fileserver-1 kernel: f2821cf4 00000001 f2821cb8 00000000 00000001 f2821cac 00000000 fa8c07f0 > May 16 00:40:19 fileserver-1 kernel: f66f1000 e44eddfc 00000001 f2821d04 fa8c2b7b 00000000 f2821ce0 f3d0b0c0 > May 16 00:40:19 fileserver-1 kernel: Call Trace: > May 16 00:40:19 fileserver-1 kernel: [<fa8c07f0>] ? ocfs2_validate_inode_block+0x0/0x280 [ocfs2] > May 16 00:40:19 fileserver-1 kernel: [<fa8c2b7b>] ? ocfs2_iget+0x5eb/0x930 [ocfs2] > May 16 00:40:19 fileserver-1 kernel: [<fa8b708a>] ? ocfs2_get_dentry+0x9a/0x1e0 [ocfs2] > May 16 00:40:19 fileserver-1 kernel: [<c04d80d2>] ? skb_copy_datagram_iovec+0x132/0x1d0 > May 16 00:40:19 fileserver-1 kernel: [<fa8b7277>] ? ocfs2_fh_to_dentry+0x47/0x60 [ocfs2] > May 16 00:40:19 fileserver-1 kernel: [<c0251cc5>] ? exportfs_decode_fh+0x35/0x1f0 > May 16 00:40:19 fileserver-1 kernel: [<c02c470f>] ? security_task_setgroups+0xf/0x20 > May 16 00:40:19 fileserver-1 kernel: [<c0132de6>] ? set_groups+0x16/0x1f0 > May 16 00:40:19 fileserver-1 kernel: [<c057794d>] ? cache_check+0x2d/0x3e0 > May 16 00:40:19 fileserver-1 kernel: [<c013305a>] ? groups_alloc+0x3a/0xc0 > May 16 00:40:19 fileserver-1 kernel: [<c025babc>] ? nfsd_setuser+0x17c/0x360 > May 16 00:40:19 fileserver-1 kernel: [<c0254bca>] ? nfsd_setuser_and_check_port+0x5a/0x60 > May 16 00:40:19 fileserver-1 kernel: [<c02599c4>] ? exp_find+0x54/0x80 > May 16 00:40:19 fileserver-1 kernel: [<c0259a26>] ? rqst_exp_find+0x36/0xd0 > May 16 00:40:19 fileserver-1 kernel: [<c0254fe4>] ? fh_verify+0x414/0x650 > May 16 00:40:19 fileserver-1 kernel: [<c02556f0>] ? nfsd_acceptable+0x0/0xe0 > May 16 00:40:19 fileserver-1 kernel: [<c011fa3b>] ? default_wake_function+0xb/0x10 > May 16 00:40:19 fileserver-1 kernel: [<c057794d>] ? cache_check+0x2d/0x3e0 > May 16 00:40:19 fileserver-1 kernel: [<c025d6f9>] ? nfsd3_proc_getattr+0x69/0xe0 > May 16 00:40:19 fileserver-1 kernel: [<c025fbb0>] ? nfs3svc_decode_fhandle+0x0/0x40 > May 16 00:40:19 fileserver-1 kernel: [<c025fbb0>] ? nfs3svc_decode_fhandle+0x0/0x40 > May 16 00:40:19 fileserver-1 kernel: [<c025208a>] ? nfsd_dispatch+0x9a/0x220 > May 16 00:40:19 fileserver-1 kernel: [<c0251ff0>] ? nfsd_dispatch+0x0/0x220 > May 16 00:40:19 fileserver-1 kernel: [<c057106b>] ? svc_process+0x3eb/0x6c0 > May 16 00:40:19 fileserver-1 kernel: [<c0252746>] ? nfsd+0x136/0x240 > May 16 00:40:19 fileserver-1 kernel: [<c011c5d8>] ? complete+0x48/0x60 > May 16 00:40:19 fileserver-1 kernel: [<c0252610>] ? nfsd+0x0/0x240 > May 16 00:40:19 fileserver-1 kernel: [<c0138972>] ? kthread+0x42/0x70 > May 16 00:40:19 fileserver-1 kernel: [<c0138930>] ? kthread+0x0/0x70 > May 16 00:40:19 fileserver-1 kernel: [<c010389b>] ? kernel_thread_helper+0x7/0x1c > May 16 00:40:19 fileserver-1 kernel: Code: 8f fa 85 d2 ba 20 dc 8f fa 0f 44 c2 89 86 9c 00 00 00 e9 39 ff ff ff 83 8e 44 01 00 00 20 e9 a1 fc ff ff 0f 0b eb fe 0f 0b eb fe <0f> 0b eb fe 0f 0b eb fe 90 8d b4 26 00 00 00 00 55 89 e5 57 56 > May 16 00:40:19 fileserver-1 kernel: EIP: [<fa8c2580>] ocfs2_populate_inode+0x550/0x560 [ocfs2] SS:ESP 0068:f2821c6c > May 16 00:40:19 fileserver-1 kernel: ---[ end trace 3b05f9cfd74396a1 ]--- > > NFS with OCFS2 problems? > I went back to my previous kernel 2.6.25.5 and it seemed to be stable. At this moment I have 2 mounted (production) filesystems and 1 umounted corrupted filesystem. This morning I looked in the logs and again errors! > Many like this: > (249,1):ocfs2_orphan_del:1869 ERROR: status = -2 > (249,1):ocfs2_remove_inode:610 ERROR: status = -2 > (249,1):ocfs2_wipe_inode:736 ERROR: status = -2 > (249,1):ocfs2_delete_inode:970 ERROR: status = -2Finally, a harmless error. ENOENT. It probably is NFS related. I don't have the said kernel handy right now to confirm.> This came from the 2 filesystems that seemed to be clean last night. > > - What can I do to prevent filesystem corruption on my 2 production OCFS2 filesystems and get rid of the above errors? > - Is it possible to fix the corrupted thirth filesystem? > - What is the most stable kernel (or setup) in my case? Now (and the last year) I am using 2.6.25.5. The 2.6.29.3 kernel I've tried crashed after a couple of minutes. > > Versions: > OS: Debian Etch (4.0) > kernel: custom 2.6.25.5I would avoid LVM. We are working towards supporting CLVM but we don't do it today. I mean using CLVM would be better. Actually, if you have-to-have-to use LVM, use sles11 ha ext. It will have proper ocfs2/clvm support. I am not sure as to why 2.6.29.3 crashed and 2.6.25 worked. The error reported by 2.6.29.3 should have shown up with 2.6.25 too. Just for the record - we did run the full fs regression with 2.6.29-stock kernel.