From reading the archives I can see this issue has been hit before but I haven't found a resolution. I have a 50gb partition... I have formatted it at 10gb. I have it set for 4 cluster members and am using 3 of those slots. I fill the partition to 66% and voila... no space left on device. I have tried it with big files and lots of small files and both ways I hit this error at 66% usage. I am using ubuntu-server with ocfs2-tools 1.4.2-1 If anyone has ideas/solutions I would be most grateful... this FS is awesome :P -- Todd Freeman Ext 6103 .^. Don't fear the penguins! Programming Department /V\ Andrews University // \\ http://www.linux.org/ http://www.andrews.edu/~freeman/ /( )\ http://www.debian.org/ ^^ ^^
Which kernel are you using? We have fixed this issue in mainline. We will soon have the same fix for production kernels. On 09/07/2010 02:06 PM, Todd Freeman wrote:> From reading the archives I can see this issue has been hit before but > I haven't found a resolution. > > I have a 50gb partition... I have formatted it at 10gb. I have it set > for 4 cluster members and am using 3 of those slots. > > I fill the partition to 66% and voila... no space left on device. I > have tried it with big files and lots of small files and both ways I hit > this error at 66% usage. > > I am using ubuntu-server with ocfs2-tools 1.4.2-1 > > > If anyone has ideas/solutions I would be most grateful... this FS is > awesome :P > >
On 09/29/2010 05:13 PM, Alexander Barton wrote:> Hello again! > > Am 21.09.2010 um 11:04 schrieb Tao Ma: > >> On 09/21/2010 04:52 PM, Alexander Barton wrote: >> >>> So kernel 2.6.35.4 would be ok? >> >> It should work. >> >>> And OCFS2 tools from the GIT master branch? Or a special tag? There is no archive or release, right? >> >> I have already committed the patches to ocfs2-tools. >> So you can get from >> git clone git://oss.oracle.com/git/ocfs2-tools.git > > We upgraded both of our cluster nodes last friday to > > - Debian Linux Kernel ?2.6.35-trunk-amd64? > (linux-image-2.6.35-trunk-amd64_2.6.35-1~experimental.3_amd64.deb) > which is 2.6.35.4 plus Debian patches > > - OCFS2 tools 1.6.3 from GIT > > Since then, our cluster is VERY unstable, ge get lots of ?general protection faults? and hard lockups. ?Lots? as in ?often more than 2 times a day?.sorry for the trouble.> > Our scenario is OCFS2 on top of DRBD. It looks like the ?crash pattern? is the following: > > On Node 2: > > cl1-n2 kernel: [ 4006.829327] general protection fault: 0000 [#21] SMP > cl1-n2 kernel: [ 4006.829487] last sysfs file: /sys/devices/platform/coretemp.7/temp1_label > cl1-n2 kernel: [ 4006.829558] CPU 1 > cl1-n2 kernel: [ 4006.829611] Modules linked in: ocfs2 jbd2 quota_tree tun xt_tcpudp iptable_filter hmac sha1_generic ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue iptable_nat nf_nat configfs nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables nfsd exportfs nfs lockd fscache nfs_acl auth_rpcgss sunrpc ext2 coretemp drbd lru_cache cn loop hed tpm_tis snd_pcm snd_timer snd soundcore psmouse snd_page_alloc processor tpm pcspkr evdev joydev tpm_bios dcdbas serio_raw i5k_amb button rng_core shpchp pci_hotplug i5000_edac edac_core ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod usbhid hid sg sr_mod cdrom ata_generic sd_mod ses crc_t10dif enclosure ata_piix ehci_hcd uhci_hcd usbcore bnx2 libata nls_base megaraid_sas scsi_mod e1000e thermal fan thermal_sys [last unloaded: scsi_wait_scan] > cl1-n2 kernel: [ 4006.833215] > cl1-n2 kernel: [ 4006.833215] Pid: 7699, comm: apache2 Tainted: G D 2.6.35-trunk-amd64 #1 0H603H/PowerEdge 2950 > cl1-n2 kernel: [ 4006.833215] RIP: 0010:[<ffffffff810e1886>] [<ffffffff810e1886>] __kmalloc+0xd3/0x136 > cl1-n2 kernel: [ 4006.833215] RSP: 0018:ffff88012e277cd8 EFLAGS: 00010006 > cl1-n2 kernel: [ 4006.833215] RAX: 0000000000000000 RBX: 6f635f6465727265 RCX: ffffffffa0686032 > cl1-n2 kernel: [ 4006.833215] RDX: 0000000000000000 RSI: ffff88012e277da8 RDI: 0000000000000004 > cl1-n2 kernel: [ 4006.833215] RBP: ffffffff81625520 R08: ffff880001a52510 R09: 0000000000000003 > cl1-n2 kernel: [ 4006.833215] R10: ffff88009a561b40 R11: ffff88022d62f400 R12: 000000000000000b > cl1-n2 kernel: [ 4006.833215] R13: 0000000000008050 R14: 0000000000008050 R15: 0000000000000246 > cl1-n2 kernel: [ 4006.833215] FS: 00007f9199715740(0000) GS:ffff880001a40000(0000) knlGS:0000000000000000 > cl1-n2 kernel: [ 4006.833215] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > cl1-n2 kernel: [ 4006.833215] CR2: 00000000402de9d0 CR3: 00000001372a1000 CR4: 00000000000406e0 > cl1-n2 kernel: [ 4006.833215] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > cl1-n2 kernel: [ 4006.833215] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > cl1-n2 kernel: [ 4006.833215] Process apache2 (pid: 7699, threadinfo ffff88012e276000, task ffff88009a561b40) > cl1-n2 kernel: [ 4006.833215] Stack: > cl1-n2 kernel: [ 4006.833215] ffff8801b1af9c20 ffffffffa0686032 ffff8801b1a2da20 ffff88018f5f30c0 > cl1-n2 kernel: [ 4006.833215]<0> ffff88012e277e88 000000000000000a ffff88018d105300 ffff88009a561b40 > cl1-n2 kernel: [ 4006.833215]<0> ffff88012e277da8 ffffffffa0686032 ffff88012e277e88 ffff88012e277da8 > cl1-n2 kernel: [ 4006.833215] Call Trace: > cl1-n2 kernel: [ 4006.833215] [<ffffffffa0686032>] ? ocfs2_fast_follow_link+0x166/0x284 [ocfs2] > cl1-n2 kernel: [ 4006.833215] [<ffffffff810f29fa>] ? do_follow_link+0xdb/0x24c > cl1-n2 kernel: [ 4006.833215] [<ffffffff810f2d55>] ? link_path_walk+0x1ea/0x482 > cl1-n2 kernel: [ 4006.833215] [<ffffffff810f311f>] ? path_walk+0x63/0xd6 > cl1-n2 kernel: [ 4006.833215] [<ffffffff810f27ba>] ? path_init+0x46/0x1ab > cl1-n2 kernel: [ 4006.833215] [<ffffffff810f3288>] ? do_path_lookup+0x20/0x85 > cl1-n2 kernel: [ 4006.833215] [<ffffffff810f3cd9>] ? user_path_at+0x46/0x78 > cl1-n2 kernel: [ 4006.833215] [<ffffffff81038bac>] ? pick_next_task_fair+0xe6/0xf6 > cl1-n2 kernel: [ 4006.833215] [<ffffffff81305101>] ? schedule+0x4d4/0x530 > cl1-n2 kernel: [ 4006.833215] [<ffffffff81060526>] ? prepare_creds+0x87/0x9c > cl1-n2 kernel: [ 4006.833215] [<ffffffff810e8649>] ? sys_faccessat+0x96/0x15b > cl1-n2 kernel: [ 4006.833215] [<ffffffff810089c2>] ? system_call_fastpath+0x16/0x1b > cl1-n2 kernel: [ 4006.833215] Code: 0f 1f 44 00 00 49 89 c7 fa 66 0f 1f 44 00 00 65 4c 8b 04 25 b0 ea 00 00 48 8b 45 00 49 01 c0 49 8b 18 48 85 db 74 0d 48 63 45 18<48> 8b 04 03 49 89 00 eb 11 83 ca ff 44 89 f6 48 89 ef e8 a1 f1 > cl1-n2 kernel: [ 4006.833215] RIP [<ffffffff810e1886>] __kmalloc+0xd3/0x136 > cl1-n2 kernel: [ 4006.833215] RSP<ffff88012e277cd8> > cl1-n2 kernel: [ 4006.833215] ---[ end trace b1eead7c8752b710 ]?Interesting, this seems to unrelated to discontig bg. Please file a bug for it.> > This fault then repeats over and over again ? > Sometime the node seems to lock up completely; then Node 1 logs: > > cl1-n1 kernel: [ 7314.523473] block drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) > cl1-n1 kernel: [ 7314.523485] block drbd0: asender terminated > cl1-n1 kernel: [ 7314.523488] block drbd0: Terminating drbd0_asender > cl1-n1 kernel: [ 7314.523495] block drbd0: short read receiving data: read 2960 expected 4096 > cl1-n1 kernel: [ 7314.523502] block drbd0: Creating new current UUID > cl1-n1 kernel: [ 7314.523509] block drbd0: error receiving Data, l: 4120! > cl1-n1 kernel: [ 7314.523705] block drbd0: Connection closed > cl1-n1 kernel: [ 7314.523710] block drbd0: conn( NetworkFailure -> Unconnected ) > cl1-n1 kernel: [ 7314.523718] block drbd0: receiver terminated > cl1-n1 kernel: [ 7314.523721] block drbd0: Restarting drbd0_receiver > cl1-n1 kernel: [ 7314.523723] block drbd0: receiver (re)started > cl1-n1 kernel: [ 7314.523730] block drbd0: conn( Unconnected -> WFConnection ) > > ? which looks ?fine? because of the 1st node not responding any more. > But then we get: > > cl1-n1 kernel: [ 7319.136065] o2net: connection to node cl1-n2 (num 1) at 10.0.1.2:6999 has been idle for 10.0 seconds, shutting it down. > cl1-n1 kernel: [ 7319.136079] (swapper,0,0):o2net_idle_timer1O:ee n ee n ee n ee n ee n ee n ee n ee ee ee ee ee ee ee ee ee ee ee ee ee sending message 506 (key 0xf5dfae8c) to node 1 > > And endless messages like the following ? > > cl1-n1 kernel: [ 7319.196135] (TunCtl,3278,6):dlm_send_remote_unlock_request:358 ERROR: Error -107 when sending message 506 (key 0xf5dfae8c) to node 1 > cl1-n1 kernel: [ 7319.196192] (IPaddr2,3416,4):dlm_send_remote_unlock_request:358 ERROR: Error -107 when sending message 506 (key 0xf5dfae8c) to node 1 > cl1-n1 kernel: [ 7319.196256] (IPaddr2,3282,1):dlm_send_remote_unlock_request:358 ERROR: Error -107 when sending message 506 (key 0xf5dfae8c) to node 1 > cl1-n1 kernel: [ 7319.196373] (linux,3271,3):dlm_send_remote_unlock_request:358 ERROR: Error -107 when sending message 506 (key 0xf5dfae8c) to node 1 > cl1-n1 kernel: [ 7319.196622] (apache2,5474,0):dlm_send_remote_unlock_request:358 ERROR: Error -107 when sending message 506 (key 0xf5dfae8c) to node 1 > ? > ? > > ? until node 1 hangs forever, too. > > After rebooting both of the nodes, the cluster runs for some time ? and errors out like above again. And again. And ? :-/ > > Two other error messages observed while trying to mount the OCFS2 filesystem on one node (but we can?t reproduce any more): > > cl1-n2 kernel: [ 361.012172] INFO: task mount.ocfs2:4969 blocked for more than 120 seconds. > cl1-n2 kernel: [ 361.012252] "echo 0> /proc/sys/kernel/hung_task_timeout_secs" disables this message. > cl1-n2 kernel: [ 361.012343] mount.ocfs2 D 00000000ffff3314 0 4969 1 0x00000000 > cl1-n2 kernel: [ 361.012347] ffff88022c8c1b40 0000000000000082 ffff880200000000 ffff88022c133e90 > cl1-n2 kernel: [ 361.012351] ffff88022ec751c0 00000000000154c0 00000000000154c0 00000000000154c0 > cl1-n2 kernel: [ 361.012354] ffff8802184b3fd8 00000000000154c0 ffff88022c8c1b40 ffff8802184b3fd8 > cl1-n2 kernel: [ 361.012356] Call Trace: > cl1-n2 kernel: [ 361.012365] [<ffffffff813054df>] ? schedule_timeout+0x2d/0xd7 > cl1-n2 kernel: [ 361.012369] [<ffffffff81305354>] ? wait_for_common+0xd1/0x14e > cl1-n2 kernel: [ 361.012374] [<ffffffff8103f630>] ? default_wake_function+0x0/0xf > cl1-n2 kernel: [ 361.012394] [<ffffffffa05bc21f>] ? __ocfs2_cluster_lock+0x6e0/0x890 [ocfs2] > cl1-n2 kernel: [ 361.012403] [<ffffffffa0551bb4>] ? dlm_register_domain+0x9e4/0xaf0 [ocfs2_dlm] > cl1-n2 kernel: [ 361.012408] [<ffffffff81190a3e>] ? hweight_long+0x5/0x6 > cl1-n2 kernel: [ 361.012420] [<ffffffffa05bd242>] ? T.775+0x18/0x1d [ocfs2] > cl1-n2 kernel: [ 361.012432] [<ffffffffa05bd2f3>] ? ocfs2_super_lock+0xac/0x2bd [ocfs2] > cl1-n2 kernel: [ 361.012443] [<ffffffffa05b4d3b>] ? ocfs2_is_hard_readonly+0x10/0x23 [ocfs2] > cl1-n2 kernel: [ 361.012455] [<ffffffffa05bd2f3>] ? ocfs2_super_lock+0xac/0x2bd [ocfs2] > cl1-n2 kernel: [ 361.012470] [<ffffffffa05f5d66>] ? ocfs2_fill_super+0x1227/0x2101 [ocfs2] > cl1-n2 kernel: [ 361.012475] [<ffffffff8118e2f3>] ? snprintf+0x36/0x3b > cl1-n2 kernel: [ 361.012478] [<ffffffff810eb86e>] ? get_sb_bdev+0x137/0x19a > cl1-n2 kernel: [ 361.012492] [<ffffffffa05f4b3f>] ? ocfs2_fill_super+0x0/0x2101 [ocfs2] > cl1-n2 kernel: [ 361.012495] [<ffffffff810eaf45>] ? vfs_kern_mount+0xa6/0x196 > cl1-n2 kernel: [ 361.012498] [<ffffffff810eb094>] ? do_kern_mount+0x49/0xe7 > cl1-n2 kernel: [ 361.012502] [<ffffffff810ff38b>] ? do_mount+0x75c/0x7d6 > cl1-n2 kernel: [ 361.012506] [<ffffffff810d90ba>] ? alloc_pages_current+0x9f/0xc2 > cl1-n2 kernel: [ 361.012508] [<ffffffff810ff48d>] ? sys_mount+0x88/0xc3 > cl1-n2 kernel: [ 361.012513] [<ffffffff810089c2>] ? system_call_fastpath+0x16/0x1b > > cl1-n2 kernel: [ 361.012518] INFO: task mount.ocfs2:5481 blocked for more than 120 seconds. > cl1-n2 kernel: [ 361.012588] "echo 0> /proc/sys/kernel/hung_task_timeout_secs" disables this message. > cl1-n2 kernel: [ 361.012674] mount.ocfs2 D 00000000ffff63dd 0 5481 1 0x00000000 > cl1-n2 kernel: [ 361.012677] ffff88022c7a2210 0000000000000082 ffff88022c5ae2c8 ffffffff00000000 > cl1-n2 kernel: [ 361.012680] ffff88022ec76d00 00000000000154c0 00000000000154c0 00000000000154c0 > cl1-n2 kernel: [ 361.012683] ffff880218485fd8 00000000000154c0 ffff88022c7a2210 ffff880218485fd8 > cl1-n2 kernel: [ 361.012685] Call Trace: > cl1-n2 kernel: [ 361.012691] [<ffffffff813063b5>] ? rwsem_down_failed_common+0x97/0xcb > cl1-n2 kernel: [ 361.012694] [<ffffffff810ea58f>] ? test_bdev_super+0x0/0xd > cl1-n2 kernel: [ 361.012697] [<ffffffff81306405>] ? rwsem_down_write_failed+0x1c/0x25 > cl1-n2 kernel: [ 361.012700] [<ffffffff8118f2c3>] ? call_rwsem_down_write_failed+0x13/0x20 > cl1-n2 kernel: [ 361.012703] [<ffffffff81305d2e>] ? down_write+0x25/0x27 > cl1-n2 kernel: [ 361.012705] [<ffffffff810eb361>] ? sget+0x99/0x34d > cl1-n2 kernel: [ 361.012708] [<ffffffff810ea565>] ? set_bdev_super+0x0/0x2a > cl1-n2 kernel: [ 361.012710] [<ffffffff810eb7d5>] ? get_sb_bdev+0x9e/0x19a > cl1-n2 kernel: [ 361.012724] [<ffffffffa05f4b3f>] ? ocfs2_fill_super+0x0/0x2101 [ocfs2] > cl1-n2 kernel: [ 361.012727] [<ffffffff810eaf45>] ? vfs_kern_mount+0xa6/0x196 > cl1-n2 kernel: [ 361.012730] [<ffffffff810eb094>] ? do_kern_mount+0x49/0xe7 > cl1-n2 kernel: [ 361.012733] [<ffffffff810ff38b>] ? do_mount+0x75c/0x7d6 > cl1-n2 kernel: [ 361.012735] [<ffffffff810d90ba>] ? alloc_pages_current+0x9f/0xc2 > cl1-n2 kernel: [ 361.012738] [<ffffffff810ff48d>] ? sys_mount+0x88/0xc3 > > Ok, so this Kernel and/or OCFS2-tools upgrade is a very unpleasant one for us :-( > But what can we do? > > Is it possible for us to downgrade the kernel and ocfs2-tools after ?tunefs.ocfs2 --fs-features=discontig-bg??just use tunefs.ocfs2 --fs-features=nodiscontig-bg to disable it and you can go back to the old kernel and use it. Regards, Tao
Am 29.09.2010 um 11:25 schrieb Tao Ma:> On 09/29/2010 05:13 PM, Alexander Barton wrote: > >> cl1-n2 kernel: [ 4006.829327] general protection fault: 0000 [#21] SMP >> cl1-n2 kernel: [ 4006.829487] last sysfs file: /sys/devices/platform/coretemp.7/temp1_label >> cl1-n2 kernel: [ 4006.829558] CPU 1 >> cl1-n2 kernel: [ 4006.829611] Modules linked in: ocfs2 jbd2 quota_tree tun xt_tcpudp iptable_filter hmac sha1_generic ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue iptable_nat nf_nat configfs nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables nfsd exportfs nfs lockd fscache nfs_acl auth_rpcgss sunrpc ext2 coretemp drbd lru_cache cn loop hed tpm_tis snd_pcm snd_timer snd soundcore psmouse snd_page_alloc processor tpm pcspkr evdev joydev tpm_bios dcdbas serio_raw i5k_amb button rng_core shpchp pci_hotplug i5000_edac edac_core ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod usbhid hid sg sr_mod cdrom ata_generic sd_mod ses crc_t10dif enclosure ata_piix ehci_hcd uhci_hcd usbcore bnx2 libata nls_base megaraid_sas scsi_mod e1000e thermal fan thermal_sys [last unloaded: scsi_wait_scan] >> cl1-n2 kernel: [ 4006.833215] >> cl1-n2 kernel: [ 4006.833215] Pid: 7699, comm: apache2 Tainted: G D 2.6.35-trunk-amd64 #1 0H603H/PowerEdge 2950 >> cl1-n2 kernel: [ 4006.833215] RIP: 0010:[<ffffffff810e1886>] [<ffffffff810e1886>] __kmalloc+0xd3/0x136 >> cl1-n2 kernel: [ 4006.833215] RSP: 0018:ffff88012e277cd8 EFLAGS: 00010006 >> cl1-n2 kernel: [ 4006.833215] RAX: 0000000000000000 RBX: 6f635f6465727265 RCX: ffffffffa0686032 >> cl1-n2 kernel: [ 4006.833215] RDX: 0000000000000000 RSI: ffff88012e277da8 RDI: 0000000000000004 >> cl1-n2 kernel: [ 4006.833215] RBP: ffffffff81625520 R08: ffff880001a52510 R09: 0000000000000003 >> cl1-n2 kernel: [ 4006.833215] R10: ffff88009a561b40 R11: ffff88022d62f400 R12: 000000000000000b >> cl1-n2 kernel: [ 4006.833215] R13: 0000000000008050 R14: 0000000000008050 R15: 0000000000000246 >> cl1-n2 kernel: [ 4006.833215] FS: 00007f9199715740(0000) GS:ffff880001a40000(0000) knlGS:0000000000000000 >> cl1-n2 kernel: [ 4006.833215] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b >> cl1-n2 kernel: [ 4006.833215] CR2: 00000000402de9d0 CR3: 00000001372a1000 CR4: 00000000000406e0 >> cl1-n2 kernel: [ 4006.833215] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> cl1-n2 kernel: [ 4006.833215] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 >> cl1-n2 kernel: [ 4006.833215] Process apache2 (pid: 7699, threadinfo ffff88012e276000, task ffff88009a561b40) >> cl1-n2 kernel: [ 4006.833215] Stack: >> cl1-n2 kernel: [ 4006.833215] ffff8801b1af9c20 ffffffffa0686032 ffff8801b1a2da20 ffff88018f5f30c0 >> cl1-n2 kernel: [ 4006.833215]<0> ffff88012e277e88 000000000000000a ffff88018d105300 ffff88009a561b40 >> cl1-n2 kernel: [ 4006.833215]<0> ffff88012e277da8 ffffffffa0686032 ffff88012e277e88 ffff88012e277da8 >> cl1-n2 kernel: [ 4006.833215] Call Trace: >> cl1-n2 kernel: [ 4006.833215] [<ffffffffa0686032>] ? ocfs2_fast_follow_link+0x166/0x284 [ocfs2] >> cl1-n2 kernel: [ 4006.833215] [<ffffffff810f29fa>] ? do_follow_link+0xdb/0x24c >> cl1-n2 kernel: [ 4006.833215] [<ffffffff810f2d55>] ? link_path_walk+0x1ea/0x482 >> cl1-n2 kernel: [ 4006.833215] [<ffffffff810f311f>] ? path_walk+0x63/0xd6 >> cl1-n2 kernel: [ 4006.833215] [<ffffffff810f27ba>] ? path_init+0x46/0x1ab >> cl1-n2 kernel: [ 4006.833215] [<ffffffff810f3288>] ? do_path_lookup+0x20/0x85 >> cl1-n2 kernel: [ 4006.833215] [<ffffffff810f3cd9>] ? user_path_at+0x46/0x78 >> cl1-n2 kernel: [ 4006.833215] [<ffffffff81038bac>] ? pick_next_task_fair+0xe6/0xf6 >> cl1-n2 kernel: [ 4006.833215] [<ffffffff81305101>] ? schedule+0x4d4/0x530 >> cl1-n2 kernel: [ 4006.833215] [<ffffffff81060526>] ? prepare_creds+0x87/0x9c >> cl1-n2 kernel: [ 4006.833215] [<ffffffff810e8649>] ? sys_faccessat+0x96/0x15b >> cl1-n2 kernel: [ 4006.833215] [<ffffffff810089c2>] ? system_call_fastpath+0x16/0x1b >> cl1-n2 kernel: [ 4006.833215] Code: 0f 1f 44 00 00 49 89 c7 fa 66 0f 1f 44 00 00 65 4c 8b 04 25 b0 ea 00 00 48 8b 45 00 49 01 c0 49 8b 18 48 85 db 74 0d 48 63 45 18<48> 8b 04 03 49 89 00 eb 11 83 ca ff 44 89 f6 48 89 ef e8 a1 f1 >> cl1-n2 kernel: [ 4006.833215] RIP [<ffffffff810e1886>] __kmalloc+0xd3/0x136 >> cl1-n2 kernel: [ 4006.833215] RSP<ffff88012e277cd8> >> cl1-n2 kernel: [ 4006.833215] ---[ end trace b1eead7c8752b710 ]? > > Interesting, this seems to unrelated to discontig bg. Please file a bug for it.Done. Bug #1292: <http://oss.oracle.com/bugzilla/show_bug.cgi?id=1292>>> This fault then repeats over and over again ? >> Sometime the node seems to lock up completely; then Node 1 logs: >> >> cl1-n1 kernel: [ 7314.523473] block drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) >> cl1-n1 kernel: [ 7314.523485] block drbd0: asender terminated >> cl1-n1 kernel: [ 7314.523488] block drbd0: Terminating drbd0_asender >> cl1-n1 kernel: [ 7314.523495] block drbd0: short read receiving data: read 2960 expected 4096 >> cl1-n1 kernel: [ 7314.523502] block drbd0: Creating new current UUID >> cl1-n1 kernel: [ 7314.523509] block drbd0: error receiving Data, l: 4120! >> cl1-n1 kernel: [ 7314.523705] block drbd0: Connection closed >> cl1-n1 kernel: [ 7314.523710] block drbd0: conn( NetworkFailure -> Unconnected ) >> cl1-n1 kernel: [ 7314.523718] block drbd0: receiver terminated >> cl1-n1 kernel: [ 7314.523721] block drbd0: Restarting drbd0_receiver >> cl1-n1 kernel: [ 7314.523723] block drbd0: receiver (re)started >> cl1-n1 kernel: [ 7314.523730] block drbd0: conn( Unconnected -> WFConnection ) >> >> ? which looks ?fine? because of the 1st node not responding any more. >> But then we get: >> >> cl1-n1 kernel: [ 7319.136065] o2net: connection to node cl1-n2 (num 1) at 10.0.1.2:6999 has been idle for 10.0 seconds, shutting it down. >> cl1-n1 kernel: [ 7319.136079] (swapper,0,0):o2net_idle_timer1O:ee n ee n ee n ee n ee n ee n ee n ee ee ee ee ee ee ee ee ee ee ee ee ee sending message 506 (key 0xf5dfae8c) to node 1 >> >> And endless messages like the following ? >> >> cl1-n1 kernel: [ 7319.196135] (TunCtl,3278,6):dlm_send_remote_unlock_request:358 ERROR: Error -107 when sending message 506 (key 0xf5dfae8c) to node 1 >> cl1-n1 kernel: [ 7319.196192] (IPaddr2,3416,4):dlm_send_remote_unlock_request:358 ERROR: Error -107 when sending message 506 (key 0xf5dfae8c) to node 1 >> cl1-n1 kernel: [ 7319.196256] (IPaddr2,3282,1):dlm_send_remote_unlock_request:358 ERROR: Error -107 when sending message 506 (key 0xf5dfae8c) to node 1 >> cl1-n1 kernel: [ 7319.196373] (linux,3271,3):dlm_send_remote_unlock_request:358 ERROR: Error -107 when sending message 506 (key 0xf5dfae8c) to node 1 >> cl1-n1 kernel: [ 7319.196622] (apache2,5474,0):dlm_send_remote_unlock_request:358 ERROR: Error -107 when sending message 506 (key 0xf5dfae8c) to node 1 >> ? >> ? >> >> ? until node 1 hangs forever, too. >> >> After rebooting both of the nodes, the cluster runs for some time ? and errors out like above again. And again. And ? :-/ >> >> Two other error messages observed while trying to mount the OCFS2 filesystem on one node (but we can?t reproduce any more): >> >> cl1-n2 kernel: [ 361.012172] INFO: task mount.ocfs2:4969 blocked for more than 120 seconds. >> cl1-n2 kernel: [ 361.012252] "echo 0> /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> cl1-n2 kernel: [ 361.012343] mount.ocfs2 D 00000000ffff3314 0 4969 1 0x00000000 >> cl1-n2 kernel: [ 361.012347] ffff88022c8c1b40 0000000000000082 ffff880200000000 ffff88022c133e90 >> cl1-n2 kernel: [ 361.012351] ffff88022ec751c0 00000000000154c0 00000000000154c0 00000000000154c0 >> cl1-n2 kernel: [ 361.012354] ffff8802184b3fd8 00000000000154c0 ffff88022c8c1b40 ffff8802184b3fd8 >> cl1-n2 kernel: [ 361.012356] Call Trace: >> cl1-n2 kernel: [ 361.012365] [<ffffffff813054df>] ? schedule_timeout+0x2d/0xd7 >> cl1-n2 kernel: [ 361.012369] [<ffffffff81305354>] ? wait_for_common+0xd1/0x14e >> cl1-n2 kernel: [ 361.012374] [<ffffffff8103f630>] ? default_wake_function+0x0/0xf >> cl1-n2 kernel: [ 361.012394] [<ffffffffa05bc21f>] ? __ocfs2_cluster_lock+0x6e0/0x890 [ocfs2] >> cl1-n2 kernel: [ 361.012403] [<ffffffffa0551bb4>] ? dlm_register_domain+0x9e4/0xaf0 [ocfs2_dlm] >> cl1-n2 kernel: [ 361.012408] [<ffffffff81190a3e>] ? hweight_long+0x5/0x6 >> cl1-n2 kernel: [ 361.012420] [<ffffffffa05bd242>] ? T.775+0x18/0x1d [ocfs2] >> cl1-n2 kernel: [ 361.012432] [<ffffffffa05bd2f3>] ? ocfs2_super_lock+0xac/0x2bd [ocfs2] >> cl1-n2 kernel: [ 361.012443] [<ffffffffa05b4d3b>] ? ocfs2_is_hard_readonly+0x10/0x23 [ocfs2] >> cl1-n2 kernel: [ 361.012455] [<ffffffffa05bd2f3>] ? ocfs2_super_lock+0xac/0x2bd [ocfs2] >> cl1-n2 kernel: [ 361.012470] [<ffffffffa05f5d66>] ? ocfs2_fill_super+0x1227/0x2101 [ocfs2] >> cl1-n2 kernel: [ 361.012475] [<ffffffff8118e2f3>] ? snprintf+0x36/0x3b >> cl1-n2 kernel: [ 361.012478] [<ffffffff810eb86e>] ? get_sb_bdev+0x137/0x19a >> cl1-n2 kernel: [ 361.012492] [<ffffffffa05f4b3f>] ? ocfs2_fill_super+0x0/0x2101 [ocfs2] >> cl1-n2 kernel: [ 361.012495] [<ffffffff810eaf45>] ? vfs_kern_mount+0xa6/0x196 >> cl1-n2 kernel: [ 361.012498] [<ffffffff810eb094>] ? do_kern_mount+0x49/0xe7 >> cl1-n2 kernel: [ 361.012502] [<ffffffff810ff38b>] ? do_mount+0x75c/0x7d6 >> cl1-n2 kernel: [ 361.012506] [<ffffffff810d90ba>] ? alloc_pages_current+0x9f/0xc2 >> cl1-n2 kernel: [ 361.012508] [<ffffffff810ff48d>] ? sys_mount+0x88/0xc3 >> cl1-n2 kernel: [ 361.012513] [<ffffffff810089c2>] ? system_call_fastpath+0x16/0x1b >> >> cl1-n2 kernel: [ 361.012518] INFO: task mount.ocfs2:5481 blocked for more than 120 seconds. >> cl1-n2 kernel: [ 361.012588] "echo 0> /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> cl1-n2 kernel: [ 361.012674] mount.ocfs2 D 00000000ffff63dd 0 5481 1 0x00000000 >> cl1-n2 kernel: [ 361.012677] ffff88022c7a2210 0000000000000082 ffff88022c5ae2c8 ffffffff00000000 >> cl1-n2 kernel: [ 361.012680] ffff88022ec76d00 00000000000154c0 00000000000154c0 00000000000154c0 >> cl1-n2 kernel: [ 361.012683] ffff880218485fd8 00000000000154c0 ffff88022c7a2210 ffff880218485fd8 >> cl1-n2 kernel: [ 361.012685] Call Trace: >> cl1-n2 kernel: [ 361.012691] [<ffffffff813063b5>] ? rwsem_down_failed_common+0x97/0xcb >> cl1-n2 kernel: [ 361.012694] [<ffffffff810ea58f>] ? test_bdev_super+0x0/0xd >> cl1-n2 kernel: [ 361.012697] [<ffffffff81306405>] ? rwsem_down_write_failed+0x1c/0x25 >> cl1-n2 kernel: [ 361.012700] [<ffffffff8118f2c3>] ? call_rwsem_down_write_failed+0x13/0x20 >> cl1-n2 kernel: [ 361.012703] [<ffffffff81305d2e>] ? down_write+0x25/0x27 >> cl1-n2 kernel: [ 361.012705] [<ffffffff810eb361>] ? sget+0x99/0x34d >> cl1-n2 kernel: [ 361.012708] [<ffffffff810ea565>] ? set_bdev_super+0x0/0x2a >> cl1-n2 kernel: [ 361.012710] [<ffffffff810eb7d5>] ? get_sb_bdev+0x9e/0x19a >> cl1-n2 kernel: [ 361.012724] [<ffffffffa05f4b3f>] ? ocfs2_fill_super+0x0/0x2101 [ocfs2] >> cl1-n2 kernel: [ 361.012727] [<ffffffff810eaf45>] ? vfs_kern_mount+0xa6/0x196 >> cl1-n2 kernel: [ 361.012730] [<ffffffff810eb094>] ? do_kern_mount+0x49/0xe7 >> cl1-n2 kernel: [ 361.012733] [<ffffffff810ff38b>] ? do_mount+0x75c/0x7d6 >> cl1-n2 kernel: [ 361.012735] [<ffffffff810d90ba>] ? alloc_pages_current+0x9f/0xc2 >> cl1-n2 kernel: [ 361.012738] [<ffffffff810ff48d>] ? sys_mount+0x88/0xc3 >>Do you have any comments to these errors and traces? Or anybody else?>> Is it possible for us to downgrade the kernel and ocfs2-tools after ?tunefs.ocfs2 --fs-features=discontig-bg?? > > just use tunefs.ocfs2 --fs-features=nodiscontig-bg to disable it and you can go back to the old kernel and use it.Ok, will try this most probably to get a stable system again. But then will will encounter the ?no space left? problem again. Hmpf :-( Thanks for your support! Alex