Michael Sternberg
2010-Jul-19 23:00 UTC
[Lustre-discuss] panic on jbd:journal_dirty_metadata
Hello, I use OSSs with external journal partitions and since lustre-1.8.1 about one to two times a week I get frustrating panics on OSSs as follows: :libcfs:cfs_alloc ... :lvfs:lprocfs_counter_add ... ... RIP [<ffffffff88031e64>] :jbd:journal_dirty_metadata+0x7f/0x1e3 RSP <ffff8101f99c3670> <0>Kernel panic - not syncing: Fatal exception (full graphical screenshot attached, hoping it passes through) Clients sometimes report: Message from syslogd@ at Mon Jul 19 04:11:46 2010 ... login4 kernel: journal commit I/O error I have recently updated to 1.8.3, where I e2fsck''d and re-initialized the external journals, but still get those panics. I use 2 OSS with heartbeat failover, each one "owns" and normally serves 2 OSTs (4 OST total), coming from 4 LUNs on a RAID unit with dual controllers. All OSTs use ldiskfs (pre-ext4 proper, if I understand correctly) with the journals located on partitions of 2 further LUNs. I use a variant of the script at bug 20807 to account for different device numbers of the external journals on the two OSSs. Failover usually works, eventually, after a load peak of up to 100 on the OSS taking over, and messages about hung threads (see below). Is there anything I could do besides giving up on external journals? My data stores are RAID1, and the journal disks are a single pair of disks also in RAID1. I had difficulties locating further information on googling "journal_dirty_metadata" pertaining to lustre/ldiskfs specifically. There are old discussions at: https://bugzilla.redhat.com/show_bug.cgi?id=183119 2007/2008 - kernel 2.4.7 http://oss.oracle.com/pipermail/ocfs2-users/2010-January/004113.html (ahem) With best regards, Michael [root at mds01 ~]# cat /proc/fs/lustre/version lustre: 1.8.3 kernel: patchless_client build: 1.8.3-20100409182943-PRISTINE-2.6.18-164.11.1.el5_lustre.1.8.3 [root at mds01 ~]# uname -r 2.6.18-164.11.1.el5_lustre.1.8.3 [root at mds01 ~]# lctl dl 0 UP mgs MGS MGS 725 1 UP mgc MGC172.17.120.1 at o2ib 9642cdcd-4955-ca05-4e85-8a9f6d10c027 5 2 UP mdt MDS MDS_uuid 3 7 UP lov sandbox-mdtlov sandbox-mdtlov_UUID 4 8 UP mds sandbox-MDT0000 sandbox-MDT0000_UUID 719 9 UP osc sandbox-OST0000-osc sandbox-mdtlov_UUID 5 10 UP osc sandbox-OST0001-osc sandbox-mdtlov_UUID 5 [root at mds01 ~]# ssh mds02 lctl dl 0 UP mgc MGC172.17.120.1 at o2ib 12c07f8c-f1e7-f739-9983-3c3aa2ec492a 5 1 UP mdt MDS MDS_uuid 3 2 UP lov carbonfs-mdtlov carbonfs-mdtlov_UUID 4 3 UP mds carbonfs-MDT0000 carbonfs-MDT0000_UUID 719 4 UP osc carbonfs-OST0001-osc carbonfs-mdtlov_UUID 5 5 UP osc carbonfs-OST0000-osc carbonfs-mdtlov_UUID 5 [root at oss01 ~]# lctl dl 0 UP mgc MGC172.17.120.1 at o2ib a89ba7f9-2a12-ff77-0321-151a1addf043 5 1 UP ost OSS OSS_uuid 3 2 UP obdfilter sandbox-OST0000 sandbox-OST0000_UUID 721 3 UP obdfilter carbonfs-OST0000 carbonfs-OST0000_UUID 721 4 UP obdfilter sandbox-OST0001 sandbox-OST0001_UUID 721 5 UP obdfilter carbonfs-OST0001 carbonfs-OST0001_UUID 721 [client]# lctl dl 0 UP mgc MGC172.17.120.1 at o2ib dadd88bf-fbad-d933-b02a-a539fd8abfea 5 1 UP lov sandbox-clilov-ffff8101da93c400 30094b3a-b246-e667-ef8a-f6690e4d051c 4 2 UP mdc sandbox-MDT0000-mdc-ffff8101da93c400 30094b3a-b246-e667-ef8a-f6690e4d051c 5 3 UP osc sandbox-OST0000-osc-ffff8101da93c400 30094b3a-b246-e667-ef8a-f6690e4d051c 5 4 UP osc sandbox-OST0001-osc-ffff8101da93c400 30094b3a-b246-e667-ef8a-f6690e4d051c 5 5 UP lov carbonfs-clilov-ffff81041dc0f800 4aaa2ddd-eee8-564f-ea48-b9c36a428eb9 4 6 UP mdc carbonfs-MDT0000-mdc-ffff81041dc0f800 4aaa2ddd-eee8-564f-ea48-b9c36a428eb9 5 7 UP osc carbonfs-OST0001-osc-ffff81041dc0f800 4aaa2ddd-eee8-564f-ea48-b9c36a428eb9 5 8 UP osc carbonfs-OST0000-osc-ffff81041dc0f800 4aaa2ddd-eee8-564f-ea48-b9c36a428eb9 5 -------------------------------------------------- "thread hung" messages: -------------------------------------------------- Jul 19 04:01:05 oss01 kernel: Lustre: carbonfs-OST0001: Recovery period over after 1:05, of 359 clients 358 recovered and 0 were evicted. Jul 19 04:01:05 oss01 kernel: Lustre: carbonfs-OST0001: sending delayed replies to recovered clientsJul 19 04:01:05 oss01 kernel: Lustre: carbonfs-OST0001: received MDS connection from 172.17.120.2 at o2ib Jul 19 04:03:36 oss01 kernel: LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 172.17.0.235 at o2ib ns: filter-carbonfs-OST0001_UUID loc k: ffff810178b8d600/0x80a4d28c4aff67ec lrc: 3/0,0 mode: PW/PW res: 152472/0 rrc: 5 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x10020 remote: 0x4701b34288b619f7 expref: 12 pid: 68 06 timeout 4356934523 Jul 19 04:04:16 oss01 kernel: Lustre: 6757:0:(ldlm_lib.c:804:target_handle_connect()) carbonfs-OST0001: exp ffff8101ecf94e00 already connectingJul 19 04:04:16 oss01 kernel: Lustre: 6757:0:(ldlm_lib.c:804:target_handle_connect()) Skipped 38 previous similar messages Jul 19 04:04:25 oss01 kernel: Lustre: Service thread pid 6145 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jul 19 04:04:25 oss01 kernel: Lustre: Skipped 2 previous similar messages Jul 19 04:04:25 oss01 kernel: Lustre: 0:0:(linux-debug.c:264:libcfs_debug_dumpstack()) showing stack for process 6145 Jul 19 04:04:25 oss01 kernel: Lustre: 0:0:(linux-debug.c:264:libcfs_debug_dumpstack()) Skipped 2 previous similar messagesJul 19 04:04:25 oss01 kernel: ll_ost_io_38 D 0000000000000000 0 6145 1 6146 6144 (L-TLB) Jul 19 04:04:25 oss01 kernel: ffff810214dcb730 0000000000000046 ffff81021158b220 ffff81017fe44cc0 Jul 19 04:04:25 oss01 kernel: ffff81017fe44cc0 000000000000000a ffff810214d4a7e0 ffff810214c0f7e0 Jul 19 04:04:25 oss01 kernel: 000038708ddb5ba3 0000000000001a95 ffff810214d4a9c8 000000042fe56000 Jul 19 04:04:25 oss01 kernel: Call Trace:Jul 19 04:04:25 oss01 kernel: [<ffffffff88036769>] :jbd:log_wait_commit+0xa3/0xf5 Jul 19 04:04:25 oss01 kernel: [<ffffffff800a00be>] autoremove_wake_function+0x0/0x2e Jul 19 04:04:25 oss01 kernel: [<ffffffff88cc900b>] :fsfilt_ldiskfs:fsfilt_ldiskfs_commit_wait+0xab/0xd0 Jul 19 04:04:25 oss01 kernel: [<ffffffff88d0798f>] :obdfilter:filter_commitrw_write+0x19ef/0x2980 Jul 19 04:04:25 oss01 kernel: [<ffffffff88ca4ed5>] :ost:ost_checksum_bulk+0x385/0x5b0Jul 19 04:04:26 oss01 kernel: [<ffffffff88ca4c38>] :ost:ost_checksum_bulk+0xe8/0x5b0 Jul 19 04:04:26 oss01 kernel: [<ffffffff88cab8da>] :ost:ost_brw_write+0x1c1a/0x23b0 Jul 19 04:04:26 oss01 kernel: [<ffffffff889ec238>] :ptlrpc:ptlrpc_send_reply+0x5c8/0x5e0 Jul 19 04:04:26 oss01 kernel: [<ffffffff889b7b10>] :ptlrpc:target_committed_to_req+0x40/0x120 Jul 19 04:04:26 oss01 kernel: [<ffffffff88ca781d>] :ost:ost_brw_read+0x189d/0x1a70Jul 19 04:04:26 oss01 kernel: [<ffffffff889f06c5>] :ptlrpc:lustre_msg_get_version+0x35/0xf0 Jul 19 04:04:26 oss01 kernel: [<ffffffff8008c86b>] default_wake_function+0x0/0xe Jul 19 04:04:26 oss01 kernel: [<ffffffff889f0788>] :ptlrpc:lustre_msg_check_version_v2+0x8/0x20 Jul 19 04:04:26 oss01 kernel: [<ffffffff88caec1e>] :ost:ost_handle+0x2bae/0x53d0 Jul 19 04:04:26 oss01 kernel: [<ffffffff889cde89>] :ptlrpc:__ldlm_add_waiting_lock+0x189/0x1c0Jul 19 04:04:26 oss01 kernel: [<ffffffff889cf10e>] :ptlrpc:ldlm_refresh_waiting_lock+0xbe/0x110 Jul 19 04:04:26 oss01 kernel: [<ffffffff88ca2571>] :ost:ost_prolong_locks_iter+0x151/0x180 Jul 19 04:04:26 oss01 kernel: [<ffffffff889fd9b5>] :ptlrpc:ptlrpc_server_handle_request+0xaa5/0x1140 Jul 19 04:04:26 oss01 kernel: [<ffffffff80062ff8>] thread_return+0x62/0xfe Jul 19 04:04:26 oss01 kernel: [<ffffffff8003dafe>] lock_timer_base+0x1b/0x3cJul 19 04:04:26 oss01 kernel: [<ffffffff8001ca74>] __mod_timer+0xb0/0xbe Jul 19 04:04:26 oss01 kernel: [<ffffffff88a01408>] :ptlrpc:ptlrpc_main+0x1258/0x1420 Jul 19 04:04:26 oss01 kernel: [<ffffffff8008c86b>] default_wake_function+0x0/0xe Jul 19 04:04:26 oss01 kernel: [<ffffffff800b7076>] audit_syscall_exit+0x336/0x362 Jul 19 04:04:26 oss01 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11Jul 19 04:04:26 oss01 kernel: [<ffffffff88a001b0>] :ptlrpc:ptlrpc_main+0x0/0x1420 Jul 19 04:04:26 oss01 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Jul 19 04:04:26 oss01 kernel: -------------------------------------------------- -------------- next part -------------- A non-text attachment was scrubbed... Name: HPC_2010-06-30_oss02_panic-out.png Type: image/png Size: 29628 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100719/bbfce767/attachment-0001.png
Hi Michael, This looks like the problem we had some time ago after upgrading to 1.8.3 https://bugzilla.lustre.org/show_bug.cgi?id=22889 Best regards Wojciech On 20 July 2010 00:00, Michael Sternberg <sternberg at anl.gov> wrote:> Hello, > > I use OSSs with external journal partitions and since lustre-1.8.1 about > one to two times a week I get frustrating panics on OSSs as follows: > > :libcfs:cfs_alloc ... > :lvfs:lprocfs_counter_add ... > ... > > RIP [<ffffffff88031e64>] :jbd:journal_dirty_metadata+0x7f/0x1e3 > RSP <ffff8101f99c3670> > <0>Kernel panic - not syncing: Fatal exception > > (full graphical screenshot attached, hoping it passes through) > > Clients sometimes report: > > Message from syslogd@ at Mon Jul 19 04:11:46 2010 ... > login4 kernel: journal commit I/O error > > > I have recently updated to 1.8.3, where I e2fsck''d and re-initialized the > external journals, but still get those panics. I use 2 OSS with heartbeat > failover, each one "owns" and normally serves 2 OSTs (4 OST total), coming > from 4 LUNs on a RAID unit with dual controllers. All OSTs use ldiskfs > (pre-ext4 proper, if I understand correctly) with the journals located on > partitions of 2 further LUNs. I use a variant of the script at bug 20807 to > account for different device numbers of the external journals on the two > OSSs. Failover usually works, eventually, after a load peak of up to 100 on > the OSS taking over, and messages about hung threads (see below). > > > Is there anything I could do besides giving up on external journals? My > data stores are RAID1, and the journal disks are a single pair of disks also > in RAID1. > > I had difficulties locating further information on googling > "journal_dirty_metadata" pertaining to lustre/ldiskfs specifically. There > are old discussions at: > > https://bugzilla.redhat.com/show_bug.cgi?id=183119 2007/2008 - > kernel 2.4.7 > > http://oss.oracle.com/pipermail/ocfs2-users/2010-January/004113.html > (ahem) > > > With best regards, > Michael > > > [root at mds01 ~]# cat /proc/fs/lustre/version > lustre: 1.8.3 > kernel: patchless_client > build: 1.8.3-20100409182943-PRISTINE-2.6.18-164.11.1.el5_lustre.1.8.3 > > [root at mds01 ~]# uname -r > 2.6.18-164.11.1.el5_lustre.1.8.3 > > [root at mds01 ~]# lctl dl > 0 UP mgs MGS MGS 725 > 1 UP mgc MGC172.17.120.1 at o2ib 9642cdcd-4955-ca05-4e85-8a9f6d10c027 5 > 2 UP mdt MDS MDS_uuid 3 > 7 UP lov sandbox-mdtlov sandbox-mdtlov_UUID 4 > 8 UP mds sandbox-MDT0000 sandbox-MDT0000_UUID 719 > 9 UP osc sandbox-OST0000-osc sandbox-mdtlov_UUID 5 > 10 UP osc sandbox-OST0001-osc sandbox-mdtlov_UUID 5 > > [root at mds01 ~]# ssh mds02 lctl dl > 0 UP mgc MGC172.17.120.1 at o2ib 12c07f8c-f1e7-f739-9983-3c3aa2ec492a 5 > 1 UP mdt MDS MDS_uuid 3 > 2 UP lov carbonfs-mdtlov carbonfs-mdtlov_UUID 4 > 3 UP mds carbonfs-MDT0000 carbonfs-MDT0000_UUID 719 > 4 UP osc carbonfs-OST0001-osc carbonfs-mdtlov_UUID 5 > 5 UP osc carbonfs-OST0000-osc carbonfs-mdtlov_UUID 5 > > [root at oss01 ~]# lctl dl > 0 UP mgc MGC172.17.120.1 at o2ib a89ba7f9-2a12-ff77-0321-151a1addf043 5 > 1 UP ost OSS OSS_uuid 3 > 2 UP obdfilter sandbox-OST0000 sandbox-OST0000_UUID 721 > 3 UP obdfilter carbonfs-OST0000 carbonfs-OST0000_UUID 721 > 4 UP obdfilter sandbox-OST0001 sandbox-OST0001_UUID 721 > 5 UP obdfilter carbonfs-OST0001 carbonfs-OST0001_UUID 721 > > [client]# lctl dl > 0 UP mgc MGC172.17.120.1 at o2ib dadd88bf-fbad-d933-b02a-a539fd8abfea 5 > 1 UP lov sandbox-clilov-ffff8101da93c400 > 30094b3a-b246-e667-ef8a-f6690e4d051c 4 > 2 UP mdc sandbox-MDT0000-mdc-ffff8101da93c400 > 30094b3a-b246-e667-ef8a-f6690e4d051c 5 > 3 UP osc sandbox-OST0000-osc-ffff8101da93c400 > 30094b3a-b246-e667-ef8a-f6690e4d051c 5 > 4 UP osc sandbox-OST0001-osc-ffff8101da93c400 > 30094b3a-b246-e667-ef8a-f6690e4d051c 5 > 5 UP lov carbonfs-clilov-ffff81041dc0f800 > 4aaa2ddd-eee8-564f-ea48-b9c36a428eb9 4 > 6 UP mdc carbonfs-MDT0000-mdc-ffff81041dc0f800 > 4aaa2ddd-eee8-564f-ea48-b9c36a428eb9 5 > 7 UP osc carbonfs-OST0001-osc-ffff81041dc0f800 > 4aaa2ddd-eee8-564f-ea48-b9c36a428eb9 5 > 8 UP osc carbonfs-OST0000-osc-ffff81041dc0f800 > 4aaa2ddd-eee8-564f-ea48-b9c36a428eb9 5 > > > -------------------------------------------------- > "thread hung" messages: > -------------------------------------------------- > Jul 19 04:01:05 oss01 kernel: Lustre: carbonfs-OST0001: Recovery period > over after 1:05, of 359 clients 358 recovered and 0 were evicted. > Jul 19 04:01:05 oss01 kernel: Lustre: carbonfs-OST0001: sending delayed > replies to recovered clientsJul 19 04:01:05 oss01 kernel: Lustre: > carbonfs-OST0001: received MDS connection from 172.17.120.2 at o2ib > Jul 19 04:03:36 oss01 kernel: LustreError: > 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback timer > expired after 151s: evicting client at 172.17.0.235 at o2ib ns: > filter-carbonfs-OST0001_UUID loc > k: ffff810178b8d600/0x80a4d28c4aff67ec lrc: 3/0,0 mode: PW/PW res: 152472/0 > rrc: 5 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) > flags: 0x10020 remote: 0x4701b34288b619f7 expref: 12 pid: 68 > 06 timeout 4356934523 > Jul 19 04:04:16 oss01 kernel: Lustre: > 6757:0:(ldlm_lib.c:804:target_handle_connect()) carbonfs-OST0001: exp > ffff8101ecf94e00 already connectingJul 19 04:04:16 oss01 kernel: Lustre: > 6757:0:(ldlm_lib.c:804:target_handle_connect()) Skipped 38 previous similar > messages > Jul 19 04:04:25 oss01 kernel: Lustre: Service thread pid 6145 was inactive > for 200.00s. The thread might be hung, or it might only be slow and will > resume later. Dumping the stack trace for debugging purposes: > Jul 19 04:04:25 oss01 kernel: Lustre: Skipped 2 previous similar messages > Jul 19 04:04:25 oss01 kernel: Lustre: > 0:0:(linux-debug.c:264:libcfs_debug_dumpstack()) showing stack for process > 6145 > Jul 19 04:04:25 oss01 kernel: Lustre: > 0:0:(linux-debug.c:264:libcfs_debug_dumpstack()) Skipped 2 previous similar > messagesJul 19 04:04:25 oss01 kernel: ll_ost_io_38 D 0000000000000000 0 > 6145 1 6146 6144 (L-TLB) > Jul 19 04:04:25 oss01 kernel: ffff810214dcb730 0000000000000046 > ffff81021158b220 ffff81017fe44cc0 > Jul 19 04:04:25 oss01 kernel: ffff81017fe44cc0 000000000000000a > ffff810214d4a7e0 ffff810214c0f7e0 > Jul 19 04:04:25 oss01 kernel: 000038708ddb5ba3 0000000000001a95 > ffff810214d4a9c8 000000042fe56000 > Jul 19 04:04:25 oss01 kernel: Call Trace:Jul 19 04:04:25 oss01 kernel: > [<ffffffff88036769>] :jbd:log_wait_commit+0xa3/0xf5 > Jul 19 04:04:25 oss01 kernel: [<ffffffff800a00be>] > autoremove_wake_function+0x0/0x2e > Jul 19 04:04:25 oss01 kernel: [<ffffffff88cc900b>] > :fsfilt_ldiskfs:fsfilt_ldiskfs_commit_wait+0xab/0xd0 > Jul 19 04:04:25 oss01 kernel: [<ffffffff88d0798f>] > :obdfilter:filter_commitrw_write+0x19ef/0x2980 > Jul 19 04:04:25 oss01 kernel: [<ffffffff88ca4ed5>] > :ost:ost_checksum_bulk+0x385/0x5b0Jul 19 04:04:26 oss01 kernel: > [<ffffffff88ca4c38>] :ost:ost_checksum_bulk+0xe8/0x5b0 > Jul 19 04:04:26 oss01 kernel: [<ffffffff88cab8da>] > :ost:ost_brw_write+0x1c1a/0x23b0 > Jul 19 04:04:26 oss01 kernel: [<ffffffff889ec238>] > :ptlrpc:ptlrpc_send_reply+0x5c8/0x5e0 > Jul 19 04:04:26 oss01 kernel: [<ffffffff889b7b10>] > :ptlrpc:target_committed_to_req+0x40/0x120 > Jul 19 04:04:26 oss01 kernel: [<ffffffff88ca781d>] > :ost:ost_brw_read+0x189d/0x1a70Jul 19 04:04:26 oss01 kernel: > [<ffffffff889f06c5>] :ptlrpc:lustre_msg_get_version+0x35/0xf0 > Jul 19 04:04:26 oss01 kernel: [<ffffffff8008c86b>] > default_wake_function+0x0/0xe > Jul 19 04:04:26 oss01 kernel: [<ffffffff889f0788>] > :ptlrpc:lustre_msg_check_version_v2+0x8/0x20 > Jul 19 04:04:26 oss01 kernel: [<ffffffff88caec1e>] > :ost:ost_handle+0x2bae/0x53d0 > Jul 19 04:04:26 oss01 kernel: [<ffffffff889cde89>] > :ptlrpc:__ldlm_add_waiting_lock+0x189/0x1c0Jul 19 04:04:26 oss01 kernel: > [<ffffffff889cf10e>] :ptlrpc:ldlm_refresh_waiting_lock+0xbe/0x110 > Jul 19 04:04:26 oss01 kernel: [<ffffffff88ca2571>] > :ost:ost_prolong_locks_iter+0x151/0x180 > Jul 19 04:04:26 oss01 kernel: [<ffffffff889fd9b5>] > :ptlrpc:ptlrpc_server_handle_request+0xaa5/0x1140 > Jul 19 04:04:26 oss01 kernel: [<ffffffff80062ff8>] thread_return+0x62/0xfe > Jul 19 04:04:26 oss01 kernel: [<ffffffff8003dafe>] > lock_timer_base+0x1b/0x3cJul 19 04:04:26 oss01 kernel: [<ffffffff8001ca74>] > __mod_timer+0xb0/0xbe > Jul 19 04:04:26 oss01 kernel: [<ffffffff88a01408>] > :ptlrpc:ptlrpc_main+0x1258/0x1420 > Jul 19 04:04:26 oss01 kernel: [<ffffffff8008c86b>] > default_wake_function+0x0/0xe > Jul 19 04:04:26 oss01 kernel: [<ffffffff800b7076>] > audit_syscall_exit+0x336/0x362 > Jul 19 04:04:26 oss01 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11Jul > 19 04:04:26 oss01 kernel: [<ffffffff88a001b0>] > :ptlrpc:ptlrpc_main+0x0/0x1420 > Jul 19 04:04:26 oss01 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 > Jul 19 04:04:26 oss01 kernel: > -------------------------------------------------- > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >-- -- Wojciech Turek Assistant System Manager High Performance Computing Service University of Cambridge Email: wjt27 at cam.ac.uk Tel: (+)44 1223 763517 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100722/9fa839e9/attachment-0001.html
Michael Sternberg
2010-Jul-24 16:06 UTC
[Lustre-discuss] panic on jbd:journal_dirty_metadata
Wojciech, Thank you very much for your pointer. Perhaps the fact that the OSTs are nearly full contributes(?). I also see higher usage. In any case, I''ll attempt compilation with the patch applied. With best regards, Michael On Jul 22, 2010, at 9:16 , Wojciech Turek wrote:> Hi Michael, > > This looks like the problem we had some time ago after upgrading to 1.8.3 > > https://bugzilla.lustre.org/show_bug.cgi?id=22889 > > Best regards > Wojciech > > On 20 July 2010 00:00, Michael Sternberg <sternberg at anl.gov> wrote: > Hello, > > I use OSSs with external journal partitions and since lustre-1.8.1 about one to two times a week I get frustrating panics on OSSs as follows: > > :libcfs:cfs_alloc ... > :lvfs:lprocfs_counter_add ... > ... > > RIP [<ffffffff88031e64>] :jbd:journal_dirty_metadata+0x7f/0x1e3 > RSP <ffff8101f99c3670> > <0>Kernel panic - not syncing: Fatal exception >
Hi Michael, Our OST''s were also nearly full when the problem occured, after installing the patch we didn''t have a single occurence of that problem. Cheers Wojciech On 24 July 2010 17:06, Michael Sternberg <sternberg at anl.gov> wrote:> Wojciech, > > Thank you very much for your pointer. Perhaps the fact that the OSTs are > nearly full contributes(?). I also see higher usage. > > In any case, I''ll attempt compilation with the patch applied. > > > With best regards, > Michael > > > On Jul 22, 2010, at 9:16 , Wojciech Turek wrote: > > > Hi Michael, > > > > This looks like the problem we had some time ago after upgrading to 1.8.3 > > > > https://bugzilla.lustre.org/show_bug.cgi?id=22889 > > > > Best regards > > Wojciech > > > > On 20 July 2010 00:00, Michael Sternberg <sternberg at anl.gov> wrote: > > Hello, > > > > I use OSSs with external journal partitions and since lustre-1.8.1 about > one to two times a week I get frustrating panics on OSSs as follows: > > > > :libcfs:cfs_alloc ... > > :lvfs:lprocfs_counter_add ... > > ... > > > > RIP [<ffffffff88031e64>] :jbd:journal_dirty_metadata+0x7f/0x1e3 > > RSP <ffff8101f99c3670> > > <0>Kernel panic - not syncing: Fatal exception > > > >-- -- Wojciech Turek -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100726/f81e4f69/attachment.html
Michael Sternberg
2010-Aug-03 23:27 UTC
[Lustre-discuss] Solved: panic on jbd:journal_dirty_metadata
Hello Wojciech, Confirmed - I built and installed the patch as well, and the problem hasn''t occurred again here either - Thank you! For reference, I''m using the released kernel and e2fsprogs rpm plus three rebuilt rpms. The patch only affects obdfilter.ko in lustre-modules. "nm /lib/modules/2.6.18-164.11.1.el5_lustre.1.8.3/kernel/fs/lustre/obdfilter.ko" produced identical output before and after the patch, which I found reassuring. # rpm -qa | grep -e e2fs -e lustre | sort e2fsprogs-1.41.10.sun2-0redhat kernel-2.6.18-164.11.1.el5_lustre.1.8.3 lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3_<date> lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3_<date> lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3_<date> With best regards, Michael On Jul 25, 2010, at 18:08 , Wojciech Turek wrote:> Hi Michael, > > Our OST''s were also nearly full when the problem occured, after installing the patch we didn''t have a single occurence of that problem. > > Cheers > > Wojciech > > On 24 July 2010 17:06, Michael Sternberg <sternberg at anl.gov> wrote: > Wojciech, > > Thank you very much for your pointer. Perhaps the fact that the OSTs are nearly full contributes(?). I also see higher usage. > > In any case, I''ll attempt compilation with the patch applied. > > > With best regards, > Michael > > > On Jul 22, 2010, at 9:16 , Wojciech Turek wrote: > > > Hi Michael, > > > > This looks like the problem we had some time ago after upgrading to 1.8.3 > > > > https://bugzilla.lustre.org/show_bug.cgi?id=22889 > > > > Best regards > > Wojciech > > > > On 20 July 2010 00:00, Michael Sternberg <sternberg at anl.gov> wrote: > > Hello, > > > > I use OSSs with external journal partitions and since lustre-1.8.1 about one to two times a week I get frustrating panics on OSSs as follows: > > > > :libcfs:cfs_alloc ... > > :lvfs:lprocfs_counter_add ... > > ... > > > > RIP [<ffffffff88031e64>] :jbd:journal_dirty_metadata+0x7f/0x1e3 > > RSP <ffff8101f99c3670> > > <0>Kernel panic - not syncing: Fatal exception > >