Hi again, Running lustre in mixed endian env produced random kernel crashes on MDS, OSS and client side. The scenario is mounting lustre on the client and writing a file onto the lustre fs. The dump is below. Appreciate your help Itamar. [4295274.606000] Badness in smp_call_function at arch/mips/kernel/smp.c:160 [4295274.606000] Call Trace: [4295274.606000] [<ffffffff81112978>] smp_call_function+0xf8/0x214 [4295274.606000] [<ffffffff81480d68>] _spin_unlock_bh+0x0/0x14 [4295274.606000] [<ffffffff81480b98>] _spin_lock_bh+0x0/0x30 [4295274.606000] [<ffffffff811228bc>] handle_exception+0x18c/0xe18 [4295274.606000] [<c00000000006fc2c>] libcfs_debug_msg+0x0/0x38 [libcfs] [4295274.606000] [<ffffffff81480d68>] _spin_unlock_bh+0x0/0x14 [4295274.606000] [<ffffffff81480b98>] _spin_lock_bh+0x0/0x30 [4295274.606000] [<ffffffff81121eb0>] trap_low+0x210/0x38c [4295274.606000] [<c00000000006fc2c>] libcfs_debug_msg+0x0/0x38 [libcfs] [4295274.606000] [<ffffffff81480d68>] _spin_unlock_bh+0x0/0x14 [4295274.606000] [<ffffffff81480b98>] _spin_lock_bh+0x0/0x30 [4295274.606000] [<c00000000029d738>] lustre_msg_get_flags+0x94/0x190 [ptlrpc] [4295274.606000] [<c00000000029d6a4>] lustre_msg_get_flags+0x0/0x190 [ptlrpc] [4295274.606000] [<c00000000029d824>] lustre_msg_get_flags+0x180/0x190 [ptlrpc] [4295274.606000] [<ffffffff8110e170>] do_gettimeofday+0x50/0x1d0 [4295274.606000] [<ffffffff8114473c>] getnstimeofday+0x18/0x4c [4295274.606000] [<ffffffff8145bc2c>] packet_rcv_spkt+0x328/0x34c [4295274.606000] [<ffffffff8145bc2c>] packet_rcv_spkt+0x328/0x34c [4295274.606000] [<ffffffff8140a1a0>] qdisc_restart+0x198/0x30c [4295274.606000] [<ffffffff813f95cc>] dev_queue_xmit+0x120/0x2f4 [4295274.606000] [<c00000000029c073>] lustre_swab_buf+0x19f/0x1f8 [ptlrpc] [4295274.606000] [<ffffffff8141beb4>] ip_queue_xmit+0x514/0x5dc [4295274.606000] [<ffffffff8110b128>] do_IRQ+0x24/0x34 [4295274.606000] [<ffffffff811037dc>] ll_cputimer_irq+0xc/0x14 [4295274.606000] [<ffffffff8110ef14>] timer_interrupt+0x48c/0x4ac [4295274.606000] [<ffffffff81430b58>] tcp_transmit_skb+0x7a8/0x7f0 [4295274.606000] [<ffffffff81167c3c>] handle_IRQ_event+0x64/0xc4 [4295274.606000] [<ffffffff81149f20>] __mod_timer+0x38/0x11c [4295274.606000] [<ffffffff813ef988>] lock_sock+0xd8/0xf0 [4295274.606000] [<ffffffff81431d4c>] __tcp_push_pending_frames +0x300/0x3f0 [4295274.606000] [<ffffffff813f1c4c>] __alloc_skb+0x98/0x164 [4295274.606000] [<ffffffff813f52c8>] skb_copy_datagram_iovec +0x68/0x2ec [4295274.606000] [<ffffffff81425ea8>] tcp_recvmsg+0x824/0x968 [4295274.606000] [<ffffffff81425ea8>] tcp_recvmsg+0x824/0x968 [4295274.606000] [<ffffffff8115c488>] ktime_get+0x10/0x38 [4295274.606000] [<ffffffff813f0acc>] sock_common_recvmsg+0x38/0x54 [4295274.606000] [<ffffffff813ebdc4>] sock_recvmsg+0xb8/0xe8 [4295274.606000] [<c00000000041b0c0>] lnet_me_unlink+0x1fc/0x3ec [lnet] [4295274.606000] [<ffffffff811597c8>] autoremove_wake_function+0x0/0x44 [4295274.606000] [<c00000000006fc2c>] libcfs_debug_msg+0x0/0x38 [libcfs] [4295274.606000] [<c00000000029d738>] lustre_msg_get_flags+0x94/0x190 [ptlrpc] [4295274.606000] [<c0000000002a7448>] debug_req+0x200/0x4f8 [ptlrpc] [4295274.606000] [<c000000000070b6c>] libcfs_debug_vmsg+0x6b8/0xd54 [libcfs] [4295274.606000] [<c000000000070598>] libcfs_debug_vmsg+0xe4/0xd54 [libcfs] [4295274.606000] [<c00000000006fc2c>] libcfs_debug_msg+0x0/0x38 [libcfs] [4295274.606000] [<ffffffff81480d68>] _spin_unlock_bh+0x0/0x14 [4295274.606000] [<ffffffff81480b98>] _spin_lock_bh+0x0/0x30 [4295274.606000] [<c0000000002a8380>] reply_in_callback+0x650/0x904 [ptlrpc] [4295274.606000] [<ffffffff81480b98>] _spin_lock_bh+0x0/0x30 [4295274.606000] [<c00000000041b5e8>] lnet_enq_event_locked+0x148/0x1dc [lnet] [4295274.606000] [<c00000000041c140>] lnet_finalize+0x268/0x468 [lnet] [4295274.606000] [<ffffffff81480b98>] _spin_lock_bh+0x0/0x30 [4295274.606000] [<c00000000033fc44>] ksocknal_process_receive +0x8b4/0x96c [ksocklnd] [4295274.606000] [<c000000000340698>] ksocknal_scheduler+0x488/0xc90 [ksocklnd] [4295274.606000] [<c000000000340d8c>] ksocknal_scheduler+0xb7c/0xc90 [ksocklnd] [4295274.606000] [<ffffffff811597c8>] autoremove_wake_function+0x0/0x44 [4295274.606000] [<ffffffff811597c8>] autoremove_wake_function+0x0/0x44 [4295274.606000] [<ffffffff8110b838>] kernel_thread_helper+0x10/0x18 [4295274.606000]
From: Itamar Ofek <itamaro@fabrix.tv> Date: Mon, 27 Nov 2006 14:38:21 +0200 Hi again, Running lustre in mixed endian env produced random kernel crashes on MDS, OSS and client side. The scenario is mounting lustre on the client and writing a file onto the lustre fs. When I tried running a big-endian client against a little-endian server I couldn''t get it to connect at all. Some debugging revealed that various byte-swapping seemed to be happening at the wrong time, and it was never making it through the initial negotiation. Did you do anything special to get as far as you did? Are you just running a stock 1.6b5 distribution? What kind of machines are the client and server?
Jean-Marc Saffroy
2006-Nov-27 07:19 UTC
[Lustre-discuss] Re: lustre 1.6 beta 5 mixed endian
On Mon, 27 Nov 2006, Itamar Ofek wrote:> Hi again, > > Running lustre in mixed endian env produced random kernel crashes on > MDS, OSS and client side. The scenario is mounting lustre on the client > and writing a file onto the lustre fs. > > The dump is below.Is it the same dump on all 3 nodes? It could be useful to have some of the last kernel messages before the stack trace. Also could you show the line arch/mips/kernel/smp.c:160 ? What kernel are you using? [snip long stack trace] This stack is quite deep, any chance it could be overflowing on mips? -- Jean-Marc Saffroy - jean-marc.saffroy@ext.bull.net
Nathaniel Rutman
2006-Nov-27 08:13 UTC
[Lustre-discuss] Re: lustre 1.6 beta 5 mixed endian
There are known mixed-endian problems with the new messaging code in Lustre 1.6 betas. We are working on fixing them. https://bugzilla.lustre.org/show_bug.cgi?id=11214 Jean-Marc Saffroy wrote:> On Mon, 27 Nov 2006, Itamar Ofek wrote: > >> Hi again, >> >> Running lustre in mixed endian env produced random kernel crashes on >> MDS, OSS and client side. The scenario is mounting lustre on the client >> and writing a file onto the lustre fs. >> >> The dump is below. > > Is it the same dump on all 3 nodes? > > It could be useful to have some of the last kernel messages before the > stack trace. Also could you show the line arch/mips/kernel/smp.c:160 ? > What kernel are you using? > > [snip long stack trace] > > This stack is quite deep, any chance it could be overflowing on mips? > >