Hello!! We have a lustre cluster with two OSS servers with "2.6.16.60-0.31_lustre.1.6.7-smp" kernel. The system is installed since a mouth. The servers have 200 clients and all works well, but the last day one of the OSS serves crashed. This is the log message: Jul 21 15:40:56 lxsrv3 kernel: Assertion failure in journal_start() at fs/jbd/transaction.c:282: "handle->h_transaction->t_journal =journal" Jul 21 15:40:56 lxsrv3 kernel: ----------- [cut here ] --------- [please bite here ] --------- Jul 21 15:40:56 lxsrv3 kernel: Kernel BUG at fs/jbd/transaction.c:282 Jul 21 15:40:56 lxsrv3 kernel: invalid opcode: 0000 [1] SMP Jul 21 15:40:56 lxsrv3 kernel: last sysfs file: /devices/system/cpu/cpu0/cpufreq/scaling_max_freq Jul 21 15:40:56 lxsrv3 kernel: CPU 5 Jul 21 15:40:56 lxsrv3 kernel: Modules linked in: af_packet quota_v2 nfs xt_pkttype ipt_LOG xt_limit obdfilter fsfilt_ldiskfs ost mgc ldiskfs crc16 lustre lov mdc lquota osc ksocklnd ptlrpc obdclass lnet lvfs libcfs nfsd exportfs lockd nfs_acl sunrpc cpufreq_conservative cpufreq_ondemand cpufreq_userspace cpufreq_powersave speedstep_centrino freq_table button battery ac ip6t_REJECT xt_tcpudp ipt_REJECT xt_state iptable_mangle iptable_nat ip_nat iptable_filter ip6table_mangle ip_conntrack nfnetlink ip_tables ip6table_filter ip6_tables x_tables ipv6 loop dm_mod uhci_hcd ehci_hcd shpchp ide_cd i2c_i801 cdrom e1000 usbcore pci_hotplug i2c_core hw_random megaraid_sas ext3 jbd sg edd fan mptsas mptscsih mptbase scsi_transport_sas ahci libata piix thermal processor sd_mod scsi_mod ide_disk ide_core Jul 21 15:40:56 lxsrv3 kernel: Pid: 4978, comm: ll_ost_io_91 Tainted: G U 2.6.16.60-0.31_lustre.1.6.7-smp #1 Jul 21 15:40:56 lxsrv3 kernel: RIP: 0010:[<ffffffff881203a5>] <ffffffff881203a5>{:jbd:journal_start+98} Jul 21 15:40:56 lxsrv3 kernel: RSP: 0000:ffff8104393cd348 EFLAGS: 00010292 Jul 21 15:40:56 lxsrv3 kernel: RAX: 0000000000000073 RBX: ffff810364a5d4f8 RCX: 0000000000000292 Jul 21 15:40:56 lxsrv3 kernel: RDX: ffffffff8034e968 RSI: 0000000000000296 RDI: ffffffff8034e960 Jul 21 15:40:56 lxsrv3 kernel: RBP: ffff81044426b400 R08: ffffffff8034e968 R09: ffff81044c47b580 Jul 21 15:40:56 lxsrv3 kernel: R10: ffff810001071680 R11: ffffffff803c8000 R12: 0000000000000012 Jul 21 15:40:56 lxsrv3 kernel: R13: ffff8104393cd3d8 R14: 0000000000000080 R15: 0000000000000180 Jul 21 15:40:56 lxsrv3 kernel: FS: 00002b239e35a6f0(0000) GS:ffff81044f1a66c0(0000) knlGS:0000000000000000 Jul 21 15:40:56 lxsrv3 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Jul 21 15:40:56 lxsrv3 kernel: CR2: 00002af6312fa000 CR3: 00000004469fd000 CR4: 00000000000006e0 Jul 21 15:40:56 lxsrv3 kernel: Process ll_ost_io_91 (pid: 4978, threadinfo ffff8104393cc000, task ffff8104392a50c0) Jul 21 15:40:56 lxsrv3 kernel: Stack: ffff8103ea582260 ffff8103ea582498 ffff8103ea582260 ffffffff8873f4bc Jul 21 15:40:56 lxsrv3 kernel: ffff8103ea582260 ffff8103ea582498 0000000000000000 ffffffff80199ab3 Jul 21 15:40:56 lxsrv3 kernel: ffff8104393cd248 ffff8103ea582270 Jul 21 15:40:56 lxsrv3 kernel: Call Trace: <ffffffff8873f4bc>{:ldiskfs:ldiskfs_dquot_drop+60} Jul 21 15:40:56 lxsrv3 kernel: <ffffffff80199ab3>{clear_inode+182} <ffffffff80199e03>{dispose_list+86} Jul 21 15:40:56 lxsrv3 kernel: <ffffffff8019a045>{shrink_icache_memory+418} <ffffffff80167db3>{shrink_slab+226} Jul 21 15:40:56 lxsrv3 kernel: <ffffffff80168b8d>{try_to_free_pages+408} <ffffffff8016398b>{__alloc_pages+449} Jul 21 15:40:56 lxsrv3 kernel: <ffffffff88124ba4>{:jbd:find_revoke_record+98} <ffffffff8015f3bb>{find_or_create_page+53} Jul 21 15:40:56 lxsrv3 kernel: <ffffffff88731d61>{:ldiskfs:ldiskfs_truncate+241} <ffffffff80184040>{__getblk+29} Jul 21 15:40:56 lxsrv3 kernel: <ffffffff8016b778>{unmap_mapping_range+89} <ffffffff8872fbb7>{:ldiskfs:ldiskfs_mark_iloc_dirty+1047} Jul 21 15:40:56 lxsrv3 kernel: <ffffffff8016d227>{vmtruncate+162} <ffffffff8019aabb>{inode_setattr+34} Jul 21 15:40:56 lxsrv3 kernel: <ffffffff8873343b>{:ldiskfs:ldiskfs_setattr+459} <ffffffff8879f4cf>{:fsfilt_ldiskfs:fsfilt_ldiskfs_setattr+287} What''s the problem?? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090724/b3f035e8/attachment-0001.html
On Jul 24, 2009, at 3:35 AM, Patricia Santos Marco wrote:> > > Hello!! We have a lustre cluster with two OSS servers with > "2.6.16.60-0.31_lustre.1.6.7-smp" kernel. The system is installed > since a mouth. The servers have 200 clients and all works well, but > the last day one of the OSS serves crashed. This is the log message: > > Jul 21 15:40:56 lxsrv3 kernel: Assertion failure in journal_start() > at fs/jbd/transaction.c:282: "handle->h_transaction->t_journal == > journal" > Jul 21 15:40:56 lxsrv3 kernel: ----------- [cut here ] --------- > [please bite here ] --------- > Jul 21 15:40:56 lxsrv3 kernel: Kernel BUG at fs/jbd/transaction.c:282 > Jul 21 15:40:56 lxsrv3 kernel: invalid opcode: 0000 [1] SMP > Jul 21 15:40:56 lxsrv3 kernel: last sysfs file: /devices/system/cpu/ > cpu0/cpufreq/scaling_max_freq > Jul 21 15:40:56 lxsrv3 kernel: CPU 5 > Jul 21 15:40:56 lxsrv3 kernel: Modules linked in: af_packet quota_v2 > nfs xt_pkttype ipt_LOG xt_limit obdfilter fsfilt_ldiskfs ost mgc > ldiskfs crc16 lustre lov mdc lquota osc ksocklnd ptlrpc obdclass > lnet lvfs libcfs nfsd exportfs lockd nfs_acl sunrpc > cpufreq_conservative cpufreq_ondemand cpufreq_userspace > cpufreq_powersave speedstep_centrino freq_table button battery ac > ip6t_REJECT xt_tcpudp ipt_REJECT xt_state iptable_mangle iptable_nat > ip_nat iptable_filter ip6table_mangle ip_conntrack nfnetlink > ip_tables ip6table_filter ip6_tables x_tables ipv6 loop dm_mod > uhci_hcd ehci_hcd shpchp ide_cd i2c_i801 cdrom e1000 usbcore > pci_hotplug i2c_core hw_random megaraid_sas ext3 jbd sg edd fan > mptsas mptscsih mptbase scsi_transport_sas ahci libata piix thermal > processor sd_mod scsi_mod ide_disk ide_core > Jul 21 15:40:56 lxsrv3 kernel: Pid: 4978, comm: ll_ost_io_91 > Tainted: G U 2.6.16.60-0.31_lustre.1.6.7-smp #1 > Jul 21 15:40:56 lxsrv3 kernel: RIP: 0010:[<ffffffff881203a5>] > <ffffffff881203a5>{:jbd:journal_start+98} > Jul 21 15:40:56 lxsrv3 kernel: RSP: 0000:ffff8104393cd348 EFLAGS: > 00010292 > Jul 21 15:40:56 lxsrv3 kernel: RAX: 0000000000000073 RBX: > ffff810364a5d4f8 RCX: 0000000000000292 > Jul 21 15:40:56 lxsrv3 kernel: RDX: ffffffff8034e968 RSI: > 0000000000000296 RDI: ffffffff8034e960 > Jul 21 15:40:56 lxsrv3 kernel: RBP: ffff81044426b400 R08: > ffffffff8034e968 R09: ffff81044c47b580 > Jul 21 15:40:56 lxsrv3 kernel: R10: ffff810001071680 R11: > ffffffff803c8000 R12: 0000000000000012 > Jul 21 15:40:56 lxsrv3 kernel: R13: ffff8104393cd3d8 R14: > 0000000000000080 R15: 0000000000000180 > Jul 21 15:40:56 lxsrv3 kernel: FS: 00002b239e35a6f0(0000) > GS:ffff81044f1a66c0(0000) knlGS:0000000000000000 > Jul 21 15:40:56 lxsrv3 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: > 000000008005003b > Jul 21 15:40:56 lxsrv3 kernel: CR2: 00002af6312fa000 CR3: > 00000004469fd000 CR4: 00000000000006e0 > Jul 21 15:40:56 lxsrv3 kernel: Process ll_ost_io_91 (pid: 4978, > threadinfo ffff8104393cc000, task ffff8104392a50c0) > Jul 21 15:40:56 lxsrv3 kernel: Stack: ffff8103ea582260 > ffff8103ea582498 ffff8103ea582260 ffffffff8873f4bc > Jul 21 15:40:56 lxsrv3 kernel: ffff8103ea582260 > ffff8103ea582498 0000000000000000 ffffffff80199ab3 > Jul 21 15:40:56 lxsrv3 kernel: ffff8104393cd248 > ffff8103ea582270 > Jul 21 15:40:56 lxsrv3 kernel: Call Trace: > <ffffffff8873f4bc>{:ldiskfs:ldiskfs_dquot_drop+60} > Jul 21 15:40:56 lxsrv3 kernel: <ffffffff80199ab3>{clear_inode > +182} <ffffffff80199e03>{dispose_list+86} > Jul 21 15:40:56 lxsrv3 kernel: > <ffffffff8019a045>{shrink_icache_memory+418} > <ffffffff80167db3>{shrink_slab+226} > Jul 21 15:40:56 lxsrv3 kernel: > <ffffffff80168b8d>{try_to_free_pages+408} > <ffffffff8016398b>{__alloc_pages+449} > Jul 21 15:40:56 lxsrv3 kernel: > <ffffffff88124ba4>{:jbd:find_revoke_record+98} > <ffffffff8015f3bb>{find_or_create_page+53} > Jul 21 15:40:56 lxsrv3 kernel: > <ffffffff88731d61>{:ldiskfs:ldiskfs_truncate+241} > <ffffffff80184040>{__getblk+29} > Jul 21 15:40:56 lxsrv3 kernel: > <ffffffff8016b778>{unmap_mapping_range+89} > <ffffffff8872fbb7>{:ldiskfs:ldiskfs_mark_iloc_dirty+1047} > Jul 21 15:40:56 lxsrv3 kernel: <ffffffff8016d227>{vmtruncate > +162} <ffffffff8019aabb>{inode_setattr+34} > Jul 21 15:40:56 lxsrv3 kernel: > <ffffffff8873343b>{:ldiskfs:ldiskfs_setattr+459} > <ffffffff8879f4cf>{:fsfilt_ldiskfs:fsfilt_ldiskfs_setattr+287} > > What''s the problem?? >Patricia - This looks similar to a problem described in lustre bug 20008. You may want to look at https://bugzilla.lustre.org/show_bug.cgi?id=20008 for more info and add a comment describing the crash you experienced there so it is known that other sites are experiencing the problem. -walter
2009/7/24 Walter Poxon <walter at routingdynamics.com>> > On Jul 24, 2009, at 3:35 AM, Patricia Santos Marco wrote: > > >> >> Hello!! We have a lustre cluster with two OSS servers with >> "2.6.16.60-0.31_lustre.1.6.7-smp" kernel. The system is installed since a >> mouth. The servers have 200 clients and all works well, but the last day one >> of the OSS serves crashed. This is the log message: >> >> Jul 21 15:40:56 lxsrv3 kernel: Assertion failure in journal_start() at >> fs/jbd/transaction.c:282: "handle->h_transaction->t_journal == journal" >> Jul 21 15:40:56 lxsrv3 kernel: ----------- [cut here ] --------- [please >> bite here ] --------- >> Jul 21 15:40:56 lxsrv3 kernel: Kernel BUG at fs/jbd/transaction.c:282 >> Jul 21 15:40:56 lxsrv3 kernel: invalid opcode: 0000 [1] SMP >> Jul 21 15:40:56 lxsrv3 kernel: last sysfs file: >> /devices/system/cpu/cpu0/cpufreq/scaling_max_freq >> Jul 21 15:40:56 lxsrv3 kernel: CPU 5 >> Jul 21 15:40:56 lxsrv3 kernel: Modules linked in: af_packet quota_v2 nfs >> xt_pkttype ipt_LOG xt_limit obdfilter fsfilt_ldiskfs ost mgc ldiskfs crc16 >> lustre lov mdc lquota osc ksocklnd ptlrpc obdclass lnet lvfs libcfs nfsd >> exportfs lockd nfs_acl sunrpc cpufreq_conservative cpufreq_ondemand >> cpufreq_userspace cpufreq_powersave speedstep_centrino freq_table button >> battery ac ip6t_REJECT xt_tcpudp ipt_REJECT xt_state iptable_mangle >> iptable_nat ip_nat iptable_filter ip6table_mangle ip_conntrack nfnetlink >> ip_tables ip6table_filter ip6_tables x_tables ipv6 loop dm_mod uhci_hcd >> ehci_hcd shpchp ide_cd i2c_i801 cdrom e1000 usbcore pci_hotplug i2c_core >> hw_random megaraid_sas ext3 jbd sg edd fan mptsas mptscsih mptbase >> scsi_transport_sas ahci libata piix thermal processor sd_mod scsi_mod >> ide_disk ide_core >> Jul 21 15:40:56 lxsrv3 kernel: Pid: 4978, comm: ll_ost_io_91 Tainted: G >> U 2.6.16.60-0.31_lustre.1.6.7-smp #1 >> Jul 21 15:40:56 lxsrv3 kernel: RIP: 0010:[<ffffffff881203a5>] >> <ffffffff881203a5>{:jbd:journal_start+98} >> Jul 21 15:40:56 lxsrv3 kernel: RSP: 0000:ffff8104393cd348 EFLAGS: >> 00010292 >> Jul 21 15:40:56 lxsrv3 kernel: RAX: 0000000000000073 RBX: ffff810364a5d4f8 >> RCX: 0000000000000292 >> Jul 21 15:40:56 lxsrv3 kernel: RDX: ffffffff8034e968 RSI: 0000000000000296 >> RDI: ffffffff8034e960 >> Jul 21 15:40:56 lxsrv3 kernel: RBP: ffff81044426b400 R08: ffffffff8034e968 >> R09: ffff81044c47b580 >> Jul 21 15:40:56 lxsrv3 kernel: R10: ffff810001071680 R11: ffffffff803c8000 >> R12: 0000000000000012 >> Jul 21 15:40:56 lxsrv3 kernel: R13: ffff8104393cd3d8 R14: 0000000000000080 >> R15: 0000000000000180 >> Jul 21 15:40:56 lxsrv3 kernel: FS: 00002b239e35a6f0(0000) >> GS:ffff81044f1a66c0(0000) knlGS:0000000000000000 >> Jul 21 15:40:56 lxsrv3 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: >> 000000008005003b >> Jul 21 15:40:56 lxsrv3 kernel: CR2: 00002af6312fa000 CR3: 00000004469fd000 >> CR4: 00000000000006e0 >> Jul 21 15:40:56 lxsrv3 kernel: Process ll_ost_io_91 (pid: 4978, threadinfo >> ffff8104393cc000, task ffff8104392a50c0) >> Jul 21 15:40:56 lxsrv3 kernel: Stack: ffff8103ea582260 ffff8103ea582498 >> ffff8103ea582260 ffffffff8873f4bc >> Jul 21 15:40:56 lxsrv3 kernel: ffff8103ea582260 ffff8103ea582498 >> 0000000000000000 ffffffff80199ab3 >> Jul 21 15:40:56 lxsrv3 kernel: ffff8104393cd248 ffff8103ea582270 >> Jul 21 15:40:56 lxsrv3 kernel: Call Trace: >> <ffffffff8873f4bc>{:ldiskfs:ldiskfs_dquot_drop+60} >> Jul 21 15:40:56 lxsrv3 kernel: <ffffffff80199ab3>{clear_inode+182} >> <ffffffff80199e03>{dispose_list+86} >> Jul 21 15:40:56 lxsrv3 kernel: >> <ffffffff8019a045>{shrink_icache_memory+418} >> <ffffffff80167db3>{shrink_slab+226} >> Jul 21 15:40:56 lxsrv3 kernel: >> <ffffffff80168b8d>{try_to_free_pages+408} >> <ffffffff8016398b>{__alloc_pages+449} >> Jul 21 15:40:56 lxsrv3 kernel: >> <ffffffff88124ba4>{:jbd:find_revoke_record+98} >> <ffffffff8015f3bb>{find_or_create_page+53} >> Jul 21 15:40:56 lxsrv3 kernel: >> <ffffffff88731d61>{:ldiskfs:ldiskfs_truncate+241} >> <ffffffff80184040>{__getblk+29} >> Jul 21 15:40:56 lxsrv3 kernel: >> <ffffffff8016b778>{unmap_mapping_range+89} >> <ffffffff8872fbb7>{:ldiskfs:ldiskfs_mark_iloc_dirty+1047} >> Jul 21 15:40:56 lxsrv3 kernel: <ffffffff8016d227>{vmtruncate+162} >> <ffffffff8019aabb>{inode_setattr+34} >> Jul 21 15:40:56 lxsrv3 kernel: >> <ffffffff8873343b>{:ldiskfs:ldiskfs_setattr+459} >> <ffffffff8879f4cf>{:fsfilt_ldiskfs:fsfilt_ldiskfs_setattr+287} >> >> What''s the problem?? >> >> > Patricia - > > This looks similar to a problem described in lustre bug 20008. > > You may want to look at https://bugzilla.lustre.org/show_bug.cgi?id=20008 > for more info and add a comment describing the crash you experienced there > so it is known that other sites are experiencing the problem. > > -walter > > Thanks!! Nowadays there is a solution to this problem ? I had searching fora path in Lustre 1.8.1 but I could''t get the correct patch. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090727/a0e7cced/attachment.html