Hi list, One of our OCFS2 servers crashed with this message: Aug 26 11:33:11 soap01 kernel: Assertion failure in journal_dirty_metadata() at fs/jbd/transaction.c:1114: "handle->h_buffer_credits > 0" Aug 26 11:33:11 soap01 kernel: ----------- [cut here ] --------- [please bite here ] --------- Aug 26 11:33:11 soap01 kernel: Kernel BUG at fs/jbd/transaction.c:1114 Aug 26 11:33:11 soap01 kernel: invalid opcode: 0000 [1] SMP Aug 26 11:33:11 soap01 kernel: last sysfs file: /devices/pci0000:00/0000:00:00.0/irq Aug 26 11:33:11 soap01 kernel: CPU 0 Aug 26 11:33:11 soap01 kernel: Modules linked in: af_packet joydev ocfs2 jbd ocfs2_dlmfs ocfs2_dlm ocfs2_nodemanager configfs nfsd exportfs lockd nfs_acl sunrpc ipv6 button battery ac netconsole xt_comment xt_tcpudp xt_state iptable_filter iptable_mangle iptable_nat ip_nat ip_ conntrack nfnetlink ip_tables x_tables apparmor loop st sr_mod usbhid usb_storage hw_random shpchp ide_cd aic7xxx uhci_hcd cdrom pci_hotplug ehci_hcd scsi_transport_spi usbcore bnx2 reiserfs ata_piix ahci libata dm_snapshot qla2xxx firmware_class qla2xxx_conf intermodule edd d m_mod fan thermal processor sg megaraid_sas piix sd_mod scsi_mod ide_disk ide_core Aug 26 11:33:11 soap01 kernel: Pid: 4874, comm: nfsd Tainted: G U 2.6.16.60-0.21-smp #1 Aug 26 11:33:11 soap01 kernel: RIP: 0010:[<ffffffff885e21e0>] <ffffffff885e21e0>{:jbd:journal_dirty_metadata+200} Aug 26 11:33:11 soap01 kernel: RSP: 0018:ffff81021e9f1c18 EFLAGS: 00010292 Aug 26 11:33:11 soap01 kernel: RAX: 000000000000006e RBX: ffff8101decf30c0 RCX: 0000000000000292 Aug 26 11:33:11 soap01 kernel: RDX: ffffffff80359968 RSI: 0000000000000296 RDI: ffffffff80359960 Aug 26 11:33:11 soap01 kernel: RBP: ffff81002f753870 R08: ffffffff80359968 R09: ffff810221d3ad80 Aug 26 11:33:11 soap01 kernel: R10: ffff810001035680 R11: 0000000000000070 R12: ffff8101dda21588 Aug 26 11:33:11 soap01 kernel: R13: ffff810207e2fa90 R14: ffff8102277ab400 R15: ffff8100a4dd394c Aug 26 11:33:11 soap01 kernel: FS: 00002b7055e986d0(0000) GS:ffffffff803d3000(0000) knlGS:0000000000000000 Aug 26 11:33:11 soap01 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Aug 26 11:33:11 soap01 kernel: CR2: 00002aaaaabdb000 CR3: 000000015e180000 CR4: 00000000000006e0 Aug 26 11:33:11 soap01 kernel: Process nfsd (pid: 4874, threadinfo ffff81021e9f0000, task ffff81021f92e860) Aug 26 11:33:11 soap01 kernel: Stack: ffff81002f753870 ffff8101dda21588 0000000000000000 0000000000000003 Aug 26 11:33:11 soap01 kernel: ffff81018ba52000 ffffffff8862187f 0000000000000000 ffff81018ba52040 Aug 26 11:33:11 soap01 kernel: ffff81007f5163f8 ffffffff8860b38a Aug 26 11:33:11 soap01 kernel: Call Trace: <ffffffff8862187f>{:ocfs2:ocfs2_journal_dirty+106} Aug 26 11:33:11 soap01 kernel: <ffffffff8860b38a>{:ocfs2:__ocfs2_add_entry+745} <ffffffff88628766>{:ocfs2:ocfs2_mknod+1710} Aug 26 11:33:11 soap01 kernel: <ffffffff88628a45>{:ocfs2:ocfs2_mkdir+127} <ffffffff80192b48>{vfs_mkdir+346} Aug 26 11:33:11 soap01 kernel: <ffffffff88522f05>{:nfsd:nfsd_create+753} <ffffffff88529bb2>{:nfsd:nfsd3_proc_mkdir+217} Aug 26 11:33:11 soap01 kernel: <ffffffff8851e0ea>{:nfsd:nfsd_dispatch+216} <ffffffff884d549a>{:sunrpc:svc_process+982} Aug 26 11:33:11 soap01 kernel: <ffffffff802ea247>{__down_read+21} <ffffffff8851e46e>{:nfsd:nfsd+0} Aug 26 11:33:11 soap01 kernel: <ffffffff8851e63d>{:nfsd:nfsd+463} <ffffffff8010bed2>{child_rip+8} Aug 26 11:33:11 soap01 kernel: <ffffffff8851e46e>{:nfsd:nfsd+0} <ffffffff8851e46e>{:nfsd:nfsd+0} Aug 26 11:33:11 soap01 kernel: <ffffffff8010beca>{child_rip+0} Aug 26 11:33:11 soap01 kernel: Aug 26 11:33:11 soap01 kernel: Code: 0f 0b 68 b9 8a 5e 88 c2 5a 04 41 ff 4c 24 08 49 39 5d 28 75 Aug 26 11:33:11 soap01 kernel: RIP <ffffffff885e21e0>{:jbd:journal_dirty_metadata+200} RSP <ffff81021e9f1c18> Operating system: SuSE SLES 10SP1 Kernel: 2.6.16.60-0.21-smp OCFS2: 1.4.0-SLES Environment: * 2 FreeBSD 7.1-RELEASE-p2 NFS Clients * 2 SLES 10SP1 exporting the filesystem The FreeBSD clients are our email servers, so the main traffic is many small email files. NFS mounted with protocol version 3, readdirplus disabled, read and write buffer of 32k. Pre-crash symptoms: * The ocfs filesystem hung for a while or gets very slow; * Low or null device traffic on both nodes (checked with `iostat`); * The server load get 5 to 6 points higher; * It seems that something in kernel deadlock, as other processes (doing IO, but in other mount points with raiserfs) hug a CPU with 100% usage; Eg: There is a mysql database in raiserfs mount point and the mysqld hug the CPU when I call `rcmysql stop`; * Calling `reboot` or `shutdown -r now` blocks the console (didn't tried to run it with strace to get the locking point, but will try it happen again); * imapd on clients blocked in nfs requests; One of the processes was blocked in (FreeBSD kernel) state bo_wwa. Looking in some discussion group's over the net, this state means blocked by stale NFS server. Attaching to the process with `gdb`, its always blocked in close() libc call; imapd process backtrace: #0 0x282a5da3 in close () from /lib/libc.so.7 #1 0x282a5711 in memcpy () from /lib/libc.so.7 #2 0xbfbf9378 in ?? () #3 0x2828d58d in fclose () from /lib/libc.so.7 Could it be related to o2cb configuration? Current configuration: O2CB_HEARTBEAT_THRESHOLD=61 O2CB_IDLE_TIMEOUT_MS=60000 The heartbeat network is a GBit ethernet. Regards, -- .:''''':. .:' ` S?rgio Surkamp | Gerente de Rede :: ........ sergio at gruposinternet.com.br `:. .:' `:, ,.:' *Grupos Internet S.A.* `: :' R. Lauro Linhares, 2123 Torre B - Sala 201 : : Trindade - Florian?polis - SC :.' :: +55 48 3234-4109 : ' http://www.gruposinternet.com.br
Ok I think I have found something useful about the problem. I keep my eye on the servers all the day doing some tests and checking the filesystem activity with `iostat`. It seems that "something" is locking the filesystem for the entire cluster for about 3 minutes and few seconds, when many concurrent writes occur on both nodes (dunno if its the right behaviour, but I don't think so). On the last filesystem hang I have done a "debugfs.ocfs2> fs_locks" scan and got a busy lock on one node that was hold for the 3 minutes. Info about the fs_lock: Lockres: M000000000000000000000da9d5b7e0 Mode: No Lock Flags: Initialized Attached Busy RO Holders: 0 EX Holders: 0 Pending Action: Convert Pending Unlock Action: None Requested Mode: Protected Read Blocking Mode: No Lock debugfs: locate <M000000000000000000000da9d5b7e0> 13 //global_bitmap Why the node locks the global_bitmap? It is for committing the journal? Another strange thing is that the node holding the lock stopped all activity on filesystem until it's release and the other node keeps reading something from it. Both wrote pending requests when the lock was freed. I'm working on a test case for this little hung. And soon I'll post it here if anyone want it. Regards, S?rgio Em Wed, 26 Aug 2009 15:10:56 -0300 S?rgio Surkamp <sergio at gruposinternet.com.br> escreveu:> Hi list, > > One of our OCFS2 servers crashed with this message: > > Aug 26 11:33:11 soap01 kernel: Assertion failure in > journal_dirty_metadata() at fs/jbd/transaction.c:1114: > "handle->h_buffer_credits > 0" > Aug 26 11:33:11 soap01 kernel: ----------- [cut here ] --------- > [please bite here ] --------- > Aug 26 11:33:11 soap01 kernel: Kernel BUG at fs/jbd/transaction.c:1114 > Aug 26 11:33:11 soap01 kernel: invalid opcode: 0000 [1] SMP > Aug 26 11:33:11 soap01 kernel: last sysfs > file: /devices/pci0000:00/0000:00:00.0/irq > Aug 26 11:33:11 soap01 kernel: CPU 0 Aug 26 11:33:11 soap01 kernel: > Modules linked in: af_packet joydev ocfs2 jbd ocfs2_dlmfs ocfs2_dlm > ocfs2_nodemanager configfs nfsd exportfs lockd nfs_acl sunrpc ipv6 > button battery ac netconsole xt_comment xt_tcpudp xt_state > iptable_filter iptable_mangle iptable_nat ip_nat ip_ conntrack > nfnetlink ip_tables x_tables apparmor loop st sr_mod usbhid > usb_storage hw_random shpchp ide_cd aic7xxx uhci_hcd cdrom > pci_hotplug ehci_hcd scsi_transport_spi usbcore bnx2 reiserfs > ata_piix ahci libata dm_snapshot qla2xxx firmware_class qla2xxx_conf > intermodule edd d m_mod fan thermal processor sg megaraid_sas piix > sd_mod scsi_mod ide_disk ide_core Aug 26 11:33:11 soap01 kernel: Pid: > 4874, comm: nfsd Tainted: G U 2.6.16.60-0.21-smp #1 > Aug 26 11:33:11 soap01 kernel: RIP: 0010:[<ffffffff885e21e0>] > <ffffffff885e21e0>{:jbd:journal_dirty_metadata+200} > Aug 26 11:33:11 soap01 kernel: RSP: 0018:ffff81021e9f1c18 EFLAGS: > 00010292 > Aug 26 11:33:11 soap01 kernel: RAX: 000000000000006e RBX: > ffff8101decf30c0 RCX: 0000000000000292 > Aug 26 11:33:11 soap01 kernel: RDX: ffffffff80359968 RSI: > 0000000000000296 RDI: ffffffff80359960 > Aug 26 11:33:11 soap01 kernel: RBP: ffff81002f753870 R08: > ffffffff80359968 R09: ffff810221d3ad80 > Aug 26 11:33:11 soap01 kernel: R10: ffff810001035680 R11: > 0000000000000070 R12: ffff8101dda21588 > Aug 26 11:33:11 soap01 kernel: R13: ffff810207e2fa90 R14: > ffff8102277ab400 R15: ffff8100a4dd394c > Aug 26 11:33:11 soap01 kernel: FS: 00002b7055e986d0(0000) > GS:ffffffff803d3000(0000) knlGS:0000000000000000 > Aug 26 11:33:11 soap01 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: > 000000008005003b > Aug 26 11:33:11 soap01 kernel: CR2: 00002aaaaabdb000 CR3: > 000000015e180000 CR4: 00000000000006e0 > Aug 26 11:33:11 soap01 kernel: Process nfsd (pid: 4874, threadinfo > ffff81021e9f0000, task ffff81021f92e860) > Aug 26 11:33:11 soap01 kernel: Stack: ffff81002f753870 > ffff8101dda21588 0000000000000000 0000000000000003 > Aug 26 11:33:11 soap01 kernel: ffff81018ba52000 ffffffff8862187f > 0000000000000000 ffff81018ba52040 > Aug 26 11:33:11 soap01 kernel: ffff81007f5163f8 ffffffff8860b38a > Aug 26 11:33:11 soap01 kernel: Call Trace: > <ffffffff8862187f>{:ocfs2:ocfs2_journal_dirty+106} > Aug 26 11:33:11 soap01 kernel: > <ffffffff8860b38a>{:ocfs2:__ocfs2_add_entry+745} > <ffffffff88628766>{:ocfs2:ocfs2_mknod+1710} > Aug 26 11:33:11 soap01 kernel: > <ffffffff88628a45>{:ocfs2:ocfs2_mkdir+127} > <ffffffff80192b48>{vfs_mkdir+346} > Aug 26 11:33:11 soap01 kernel: > <ffffffff88522f05>{:nfsd:nfsd_create+753} > <ffffffff88529bb2>{:nfsd:nfsd3_proc_mkdir+217} Aug 26 11:33:11 soap01 > kernel: <ffffffff8851e0ea>{:nfsd:nfsd_dispatch+216} > <ffffffff884d549a>{:sunrpc:svc_process+982} > Aug 26 11:33:11 soap01 kernel: > <ffffffff802ea247>{__down_read+21} <ffffffff8851e46e>{:nfsd:nfsd+0} > Aug 26 11:33:11 soap01 kernel: <ffffffff8851e63d>{:nfsd:nfsd+463} > <ffffffff8010bed2>{child_rip+8} > Aug 26 11:33:11 soap01 kernel: <ffffffff8851e46e>{:nfsd:nfsd+0} > <ffffffff8851e46e>{:nfsd:nfsd+0} > Aug 26 11:33:11 soap01 kernel: <ffffffff8010beca>{child_rip+0} > Aug 26 11:33:11 soap01 kernel: > Aug 26 11:33:11 soap01 kernel: Code: 0f 0b 68 b9 8a 5e 88 c2 5a 04 41 > ff 4c 24 08 49 39 5d 28 75 > Aug 26 11:33:11 soap01 kernel: RIP > <ffffffff885e21e0>{:jbd:journal_dirty_metadata+200} RSP > <ffff81021e9f1c18> > > Operating system: SuSE SLES 10SP1 > Kernel: 2.6.16.60-0.21-smp > OCFS2: 1.4.0-SLES > > Environment: > > * 2 FreeBSD 7.1-RELEASE-p2 NFS Clients > * 2 SLES 10SP1 exporting the filesystem > > The FreeBSD clients are our email servers, so the main traffic is many > small email files. > > NFS mounted with protocol version 3, readdirplus disabled, read and > write buffer of 32k. > > Pre-crash symptoms: > * The ocfs filesystem hung for a while or gets very slow; > * Low or null device traffic on both nodes (checked with `iostat`); > * The server load get 5 to 6 points higher; > * It seems that something in kernel deadlock, as other processes > (doing IO, but in other mount points with raiserfs) hug a CPU with > 100% usage; > Eg: There is a mysql database in raiserfs mount point and the mysqld > hug the CPU when I call `rcmysql stop`; > * Calling `reboot` or `shutdown -r now` blocks the console (didn't > tried to run it with strace to get the locking point, but will try it > happen again); > * imapd on clients blocked in nfs requests; > One of the processes was blocked in (FreeBSD kernel) state bo_wwa. > Looking in some discussion group's over the net, this state means > blocked by stale NFS server. Attaching to the process with `gdb`, > its always blocked in close() libc call; > > imapd process backtrace: > #0 0x282a5da3 in close () from /lib/libc.so.7 > #1 0x282a5711 in memcpy () from /lib/libc.so.7 > #2 0xbfbf9378 in ?? () > #3 0x2828d58d in fclose () from /lib/libc.so.7 > > Could it be related to o2cb configuration? Current configuration: > > O2CB_HEARTBEAT_THRESHOLD=61 > O2CB_IDLE_TIMEOUT_MS=60000 > > The heartbeat network is a GBit ethernet. > > Regards,-- .:''''':. .:' ` S?rgio Surkamp | Gerente de Rede :: ........ sergio at gruposinternet.com.br `:. .:' `:, ,.:' *Grupos Internet S.A.* `: :' R. Lauro Linhares, 2123 Torre B - Sala 201 : : Trindade - Florian?polis - SC :.' :: +55 48 3234-4109 : ' http://www.gruposinternet.com.br
For issues on sles, please file a bug/sr with novell. The issue here is insufficient journal credits. It _could_ be that this version is missing mainline git commit e051fda4fd14fe878e6d2183b3a4640febe9e9a8. But I don't know. Novell Support will be better placed to track down the issue. S?rgio Surkamp wrote:> Hi list, > > One of our OCFS2 servers crashed with this message: > > Aug 26 11:33:11 soap01 kernel: Assertion failure in > journal_dirty_metadata() at fs/jbd/transaction.c:1114: > "handle->h_buffer_credits > 0" > Aug 26 11:33:11 soap01 kernel: ----------- [cut here ] --------- > [please bite here ] --------- > Aug 26 11:33:11 soap01 kernel: Kernel BUG at fs/jbd/transaction.c:1114 > Aug 26 11:33:11 soap01 kernel: invalid opcode: 0000 [1] SMP > Aug 26 11:33:11 soap01 kernel: last sysfs > file: /devices/pci0000:00/0000:00:00.0/irq > Aug 26 11:33:11 soap01 kernel: CPU 0 Aug 26 11:33:11 soap01 kernel: > Modules linked in: af_packet joydev ocfs2 jbd ocfs2_dlmfs ocfs2_dlm > ocfs2_nodemanager configfs nfsd exportfs lockd nfs_acl sunrpc ipv6 > button battery ac netconsole xt_comment xt_tcpudp xt_state > iptable_filter iptable_mangle iptable_nat ip_nat ip_ conntrack > nfnetlink ip_tables x_tables apparmor loop st sr_mod usbhid usb_storage > hw_random shpchp ide_cd aic7xxx uhci_hcd cdrom pci_hotplug ehci_hcd > scsi_transport_spi usbcore bnx2 reiserfs ata_piix ahci libata > dm_snapshot qla2xxx firmware_class qla2xxx_conf intermodule edd d m_mod > fan thermal processor sg megaraid_sas piix sd_mod scsi_mod ide_disk > ide_core Aug 26 11:33:11 soap01 kernel: Pid: 4874, comm: nfsd Tainted: > G U 2.6.16.60-0.21-smp #1 > Aug 26 11:33:11 soap01 kernel: RIP: 0010:[<ffffffff885e21e0>] > <ffffffff885e21e0>{:jbd:journal_dirty_metadata+200} > Aug 26 11:33:11 soap01 kernel: RSP: 0018:ffff81021e9f1c18 EFLAGS: > 00010292 > Aug 26 11:33:11 soap01 kernel: RAX: 000000000000006e RBX: > ffff8101decf30c0 RCX: 0000000000000292 > Aug 26 11:33:11 soap01 kernel: RDX: ffffffff80359968 RSI: > 0000000000000296 RDI: ffffffff80359960 > Aug 26 11:33:11 soap01 kernel: RBP: ffff81002f753870 R08: > ffffffff80359968 R09: ffff810221d3ad80 > Aug 26 11:33:11 soap01 kernel: R10: ffff810001035680 R11: > 0000000000000070 R12: ffff8101dda21588 > Aug 26 11:33:11 soap01 kernel: R13: ffff810207e2fa90 R14: > ffff8102277ab400 R15: ffff8100a4dd394c > Aug 26 11:33:11 soap01 kernel: FS: 00002b7055e986d0(0000) > GS:ffffffff803d3000(0000) knlGS:0000000000000000 > Aug 26 11:33:11 soap01 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: > 000000008005003b > Aug 26 11:33:11 soap01 kernel: CR2: 00002aaaaabdb000 CR3: > 000000015e180000 CR4: 00000000000006e0 > Aug 26 11:33:11 soap01 kernel: Process nfsd (pid: 4874, threadinfo > ffff81021e9f0000, task ffff81021f92e860) > Aug 26 11:33:11 soap01 kernel: Stack: ffff81002f753870 ffff8101dda21588 > 0000000000000000 0000000000000003 > Aug 26 11:33:11 soap01 kernel: ffff81018ba52000 ffffffff8862187f > 0000000000000000 ffff81018ba52040 > Aug 26 11:33:11 soap01 kernel: ffff81007f5163f8 ffffffff8860b38a > Aug 26 11:33:11 soap01 kernel: Call Trace: > <ffffffff8862187f>{:ocfs2:ocfs2_journal_dirty+106} > Aug 26 11:33:11 soap01 kernel: > <ffffffff8860b38a>{:ocfs2:__ocfs2_add_entry+745} > <ffffffff88628766>{:ocfs2:ocfs2_mknod+1710} > Aug 26 11:33:11 soap01 kernel: > <ffffffff88628a45>{:ocfs2:ocfs2_mkdir+127} > <ffffffff80192b48>{vfs_mkdir+346} > Aug 26 11:33:11 soap01 kernel: <ffffffff88522f05>{:nfsd:nfsd_create+753} > <ffffffff88529bb2>{:nfsd:nfsd3_proc_mkdir+217} > Aug 26 11:33:11 soap01 kernel: > <ffffffff8851e0ea>{:nfsd:nfsd_dispatch+216} > <ffffffff884d549a>{:sunrpc:svc_process+982} > Aug 26 11:33:11 soap01 kernel: <ffffffff802ea247>{__down_read+21} > <ffffffff8851e46e>{:nfsd:nfsd+0} > Aug 26 11:33:11 soap01 kernel: <ffffffff8851e63d>{:nfsd:nfsd+463} > <ffffffff8010bed2>{child_rip+8} > Aug 26 11:33:11 soap01 kernel: <ffffffff8851e46e>{:nfsd:nfsd+0} > <ffffffff8851e46e>{:nfsd:nfsd+0} > Aug 26 11:33:11 soap01 kernel: <ffffffff8010beca>{child_rip+0} > Aug 26 11:33:11 soap01 kernel: > Aug 26 11:33:11 soap01 kernel: Code: 0f 0b 68 b9 8a 5e 88 c2 5a 04 41 > ff 4c 24 08 49 39 5d 28 75 > Aug 26 11:33:11 soap01 kernel: RIP > <ffffffff885e21e0>{:jbd:journal_dirty_metadata+200} RSP > <ffff81021e9f1c18> > > Operating system: SuSE SLES 10SP1 > Kernel: 2.6.16.60-0.21-smp > OCFS2: 1.4.0-SLES > > Environment: > > * 2 FreeBSD 7.1-RELEASE-p2 NFS Clients > * 2 SLES 10SP1 exporting the filesystem > > The FreeBSD clients are our email servers, so the main traffic is many > small email files. > > NFS mounted with protocol version 3, readdirplus disabled, read and > write buffer of 32k. > > Pre-crash symptoms: > * The ocfs filesystem hung for a while or gets very slow; > * Low or null device traffic on both nodes (checked with `iostat`); > * The server load get 5 to 6 points higher; > * It seems that something in kernel deadlock, as other processes (doing > IO, but in other mount points with raiserfs) hug a CPU with 100% > usage; > Eg: There is a mysql database in raiserfs mount point and the mysqld > hug the CPU when I call `rcmysql stop`; > * Calling `reboot` or `shutdown -r now` blocks the console (didn't tried > to run it with strace to get the locking point, but will try it happen > again); > * imapd on clients blocked in nfs requests; > One of the processes was blocked in (FreeBSD kernel) state bo_wwa. > Looking in some discussion group's over the net, this state means > blocked by stale NFS server. Attaching to the process with `gdb`, its > always blocked in close() libc call; > > imapd process backtrace: > #0 0x282a5da3 in close () from /lib/libc.so.7 > #1 0x282a5711 in memcpy () from /lib/libc.so.7 > #2 0xbfbf9378 in ?? () > #3 0x2828d58d in fclose () from /lib/libc.so.7 > > Could it be related to o2cb configuration? Current configuration: > > O2CB_HEARTBEAT_THRESHOLD=61 > O2CB_IDLE_TIMEOUT_MS=60000 > > The heartbeat network is a GBit ethernet. > > Regards, >