Tom:
I just reviewed my logs and have found a similar reports on all 5 of my quad
core machines during a stretch when I had two months of uptime. So it did not
cause my production machines to lock up completely and require a power cycle. I
recently rebooted them all to do some work on the racking otherwise I''m
confident they would have still been running.
[root at ns1 ~]# uptime
07:33:25 up 9 days, 10:44, 6 users, load average: 0.02, 0.06, 0.08
I''ll be watching closer in the future for the messages in the logs and
gauge if there is any unresponsiveness from the network or the machines
requiring a reboot.
Arden
--- On Tue, 6/16/09, Tom Woezel <twoezel at it.dcs.ch> wrote:
> From: Tom Woezel <twoezel at it.dcs.ch>
> Subject: [Lustre-discuss] Kernel bug in combination with bonding
> To: lustre-discuss at lists.lustre.org
> Date: Tuesday, June 16, 2009, 3:57 AM
> Dear all,
> Currently we are running a lustre environment
> with 2 servers for MGS and MDTs and 3 OSDs, all Sun x4140
> with RedHat EL5 and Lustre 1.6.7. Recently we decided to go
> for bonding on the 3 OSDs. We bonded all 4 interfaces
> together and so far the configuration working. Today I
> recognized that one of the OSDs is showing weird behavior
> and some of the clients having problems connecting to the
> filesystem. >From what I have learned so far this is a known
> kernel bug with this kernel version
(http://bugs.centos.org/view.php?id=3095)?and
> I couldn''t find a solution for this.?
> I was wondering if any of you has encountered a
> similar problem and if so, how did you fix it?
> Current Kernel is:
> [root at sososd1 ~]# uname -aLinux
> sososd1 2.6.18-92.1.17.el5_lustre.1.6.7smp #1 SMP Mon Feb 9
> 19:56:55 MST 2009 x86_64 x86_64 x86_64 GNU/Linux
> The bondig configuration:
> [root at sososd1 ~]# cat
> /etc/modprobe.conf?alias eth0
> forcedethalias eth1 forcedethalias
> eth2 forcedethalias eth3
> forcedethalias bond0 bondingoptions
> bond0 miimon=100 mode=4alias scsi_hostadapter
> aacraidalias scsi_hostadapter1
> sata_nvalias scsi_hostadapter2
> qla2xxxalias scsi_hostadapter3
> usb-storageoptions lnet
> networks="tcp(bond0)"
> [root at sososd1 ~]# cat
>
/etc/sysconfig/network-scripts/ifcfg-bond0?DEVICE=bond0IPADDR=xxx.xxx.xxx.xxxNETMASK=xxx.xxx.xxx.xxxNETWORK=xxx.xxx.xxx.xxxBROADCAST=xxx.xxx.xxx.xxxGATEWAY=xxx.xxx.xxx.xxxONBOOT=yesBOOTPROTO=noneUSERCTL=no
> And each of the interfaces is configured like
> this:
> [root at sososd1 ~]# cat
> /etc/sysconfig/network-scripts/ifcfg-eth0# nVidia
> Corporation MCP55
> EthernetDEVICE=eth0ONBOOT=yesBOOTPROTO=noneUSERCTL=noMASTER=bond0SLAVE=yes
> And this is a extract from the log
> file:
> Jun 16 04:33:38 sososd1 kernel: BUG: soft lockup
> - CPU#2 stuck for 10s! [bond0:3914]Jun 16
> 04:33:38 sososd1 kernel: CPU 2:Jun 16 04:33:38
> sososd1 kernel: Modules linked in: obdfilter(U)
> fsfilt_ldiskfs(U) ost(U) mgc(U) ldiskfs(U) crc16(U)
> lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U)
> ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U)
> ipv6(U) xfrm_nalgo(U) crypto_api(U) autofs4(U)
> ipmi_devintf(U) ipmi_si(U) ipmi_msghandler(U) hidp(U)
> rfcomm(U) l2cap(U) bluetooth(U) sunrpc(U) bonding(U)
> dm_rdac(U) dm_round_robin(U) dm_multipath(U)
> video(U) sbs(U) backlight(U) i2c_ec(U) button(U) battery(U)
> asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U)
> parport(U) joydev(U) i2c_nforce2(U) sr_mod(U)
> cdrom(U) pata_acpi(U) i2c_core(U) forcedeth(U) sg(U)
> pcspkr(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_mod(U)
> usb_storage(U) qla2xxx(U) scsi_transport_fc(U)
> sata_nv(U) libata(U) shpchp(U) aacraid(U)
> sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U)
> ehci_hcd(U)Jun 16 04:33:38 sososd1 kernel: Pid:
> 3914, comm: bond0 Tainted: G ? ?
> ?2.6.18-92.1.17.el5_lustre.1.6.7smp #1Jun 16
> 04:33:38 sososd1 kernel: RIP:
> 0010:[<ffffffff80064b4c>]
> ?[<ffffffff80064b4c>]
> .text.lock.spinlock+0x2/0x30Jun 16 04:33:38
> sososd1 kernel: RSP: 0018:ffff81012b993d98 ?EFLAGS:
> 00000286Jun 16 04:33:38 sososd1 kernel: RAX:
> 0000000000000001 RBX: ffff81012b97a080 RCX:
> 0000000000000004Jun 16 04:33:38 sososd1 kernel:
> RDX: ffff81012b97a000 RSI: ffff81012b97a080 RDI:
> ffff81012b97a168Jun 16 04:33:38 sososd1 kernel:
> RBP: ffff81012b993d10 R08: 0000000000000000 R09:
> ffff810226ad5d28Jun 16 04:33:38 sososd1 kernel:
> R10: 000000fe0000009a R11: ffff810227efcae0 R12:
> ffffffff8005dc8eJun 16 04:33:38 sososd1 kernel:
> R13: ffff81010e39d81e R14: ffffffff80076fd7 R15:
> ffff81012b993d10Jun 16 04:33:38 sososd1 kernel:
> FS: ?00002abdd36dc220(0000) GS:ffff810104159240(0000)
> knlGS:00000000f7f928d0Jun 16 04:33:38 sososd1
> kernel: CS: ?0010 DS: 0018 ES: 0018 CR0:
> 000000008005003bJun 16 04:33:38 sososd1 kernel:
> CR2: 00002aaaac009000 CR3: 0000000000201000 CR4:
> 00000000000006e0Jun 16 04:33:38 sososd1
> kernel:?Jun 16 04:33:38 sososd1 kernel: Call
> Trace:Jun 16 04:33:38 sososd1 kernel:
> ?<IRQ> ?[<ffffffff883f0477>]
> :bonding:ad_rx_machine+0x20/0x502Jun 16 04:33:38
> sososd1 kernel: ?[<ffffffff883f0aa2>]
> :bonding:bond_3ad_lacpdu_recv+0xc1/0x1fcJun 16
> 04:33:38 sososd1 kernel: ?[<ffffffff80046717>]
> try_to_wake_up+0x407/0x418Jun 16 04:33:38 sososd1
> kernel: ?[<ffffffff80020139>]
> netif_receive_skb+0x330/0x3aeJun 16 04:33:38
> sososd1 kernel: ?[<ffffffff8020c75b>]
> pci_mmcfg_read+0x4a/0xbbJun 16 04:33:38 sososd1
> kernel: ?[<ffffffff800302f5>]
> process_backlog+0x84/0xe1Jun 16 04:33:38 sososd1
> kernel: ?[<ffffffff883f0e76>]
> :bonding:bond_3ad_state_machine_handler+0x0/0x84aJun
> 16 04:33:38 sososd1 kernel: ?[<ffffffff8000c52c>]
> net_rx_action+0xa4/0x1a4Jun 16 04:33:38 sososd1
> kernel: ?[<ffffffff80011ec2>]
> __do_softirq+0x5e/0xd6Jun 16 04:33:38 sososd1
> kernel: ?[<ffffffff80154d15>]
> end_msi_irq_w_maskbit+0xf/0x1cJun 16 04:33:38
> sososd1 kernel: ?[<ffffffff8005e2fc>]
> call_softirq+0x1c/0x28Jun 16 04:33:38 sososd1
> kernel: ?[<ffffffff8006c67e>]
> do_softirq+0x2c/0x85Jun 16 04:33:38 sososd1
> kernel: ?[<ffffffff8006c506>]
> do_IRQ+0xec/0xf5Jun 16 04:33:38 sososd1 kernel:
> ?[<ffffffff8005d615>]
> ret_from_intr+0x0/0xaJun 16 04:33:38 sososd1
> kernel: ?<EOI> ?[<ffffffff800649d8>]
> _spin_lock+0x3/0xaJun 16 04:33:38 sososd1 kernel:
> ?[<ffffffff883f0477>]
> :bonding:ad_rx_machine+0x20/0x502Jun 16 04:33:38
> sososd1 kernel: ?[<ffffffff883f0f4a>]
> :bonding:bond_3ad_state_machine_handler+0xd4/0x84aJun
> 16 04:33:38 sososd1 kernel: ?[<ffffffff8004cd5b>]
> run_workqueue+0x94/0xe4Jun 16 04:33:38 sososd1
> kernel: ?[<ffffffff80049666>]
> worker_thread+0x0/0x122Jun 16 04:33:38 sososd1
> kernel: ?[<ffffffff8009dba2>]
> keventd_create_kthread+0x0/0xc4Jun 16 04:33:38
> sososd1 kernel: ?[<ffffffff80049756>]
> worker_thread+0xf0/0x122Jun 16 04:33:38 sososd1
> kernel: ?[<ffffffff8008abb9>]
> default_wake_function+0x0/0xeJun 16 04:33:38
> sososd1 kernel: ?[<ffffffff8009dba2>]
> keventd_create_kthread+0x0/0xc4Jun 16 04:33:38
> sososd1 kernel: ?[<ffffffff8009dba2>]
> keventd_create_kthread+0x0/0xc4Jun 16 04:33:38
> sososd1 kernel: ?[<ffffffff80032409>]
> kthread+0xfe/0x132Jun 16 04:33:38 sososd1 kernel:
> ?[<ffffffff8005dfb1>]
> child_rip+0xa/0x11Jun 16 04:33:38 sososd1 kernel:
> ?[<ffffffff8009dba2>]
> keventd_create_kthread+0x0/0xc4Jun 16 04:33:38
> sososd1 kernel: ?[<ffffffff8003230b>]
> kthread+0x0/0x132Jun 16 04:33:38 sososd1 kernel:
> ?[<ffffffff8005dfa7>]
> child_rip+0x0/0x11
>
> A restart of the network didn''t work and the
> machine did not respond on console afterwards. After the
> reboot of the machine the error was gone but from what I
> have found on the web it will appear again.
> Thanks in advance for any help.
> Kind regards
-----------------------------------------------------------------
> Tom Woezel | DCS Contractor in DMO/OTS/SOS
> Group
> Office 2001 ESO/IPP | System Administrator
> Tel.:+49-89-32006-184 |
> Fax.:+49-89-32006-677 | Address:
> | European Southern Observatory
> mailto:twoezel at it.dcs.ch
> | Karl-Schwarzschild-Strasse 2
> | D-85748 Garching bei Munchen,
> Germany
> web: http://www.dcs.ch | http://www.eso.org
> -----------------------------------------------------------------
>
>
>
> -----Inline Attachment Follows-----
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>