I have a small cluster here to test the viability of Lustre for our purposes. I have 56 client nodes, an active/standby MDS pair, and 14 OSSes. One of the users started up a job on the client nodes, and the cluster promptly went nuts (I ended up having to reboot a bunch of nodes). All 72 machines are connected via Infiniband. In the logs, odfs001 is the MDS, o5056 is a client node. Do I have a misconfiguration here? Or did I break something at the transport layer? daniel -- dmayfield at zdmayfield lustre_dmesg]$grep Lustre *|grep Dropping |cut -f 1 -d :|uniq -c 2 o5056.dmesg 18 odfs001.dmesg 590 odfs002.dmesg 78 odfs003.dmesg 94 odfs004.dmesg 92 odfs005.dmesg 98 odfs006.dmesg 97 odfs007.dmesg 98 odfs008.dmesg 110 odfs010.dmesg 97 odfs011.dmesg 145 odfs012.dmesg 107 odfs013.dmesg 103 odfs014.dmesg 113 odfs015.dmesg 632 odfs016.dmesg dmayfield at zdmayfield lustre_dmesg]$grep Lustre * |grep Dropping |head o5056.dmesg:Lustre: 25671:0:(lib-move.c:1028:lnet_post_send_locked()) Dropping message for 12345-192.168.50.238 at tcp: peer not alive o5056.dmesg:Lustre: 25671:0:(lib-move.c:1028:lnet_post_send_locked()) Dropping message for 12345-192.168.50.238 at tcp: peer not alive odfs001.dmesg:Lustre: 18202:0:(lib-move.c:1826:lnet_parse_put()) Dropping PUT from 12345-0 at lo portal 26 match 1359131017745740 offset 0 length 192: 2 odfs001.dmesg:Lustre: 6822:0:(lib-move.c:1826:lnet_parse_put()) Dropping PUT from 12345-192.168.50.55 at o2ib portal 26 match 1359226700706635 offset 0 length 192: 2 odfs001.dmesg:Lustre: 6846:0:(lib-move.c:1826:lnet_parse_put()) Dropping PUT from 12345-10.4.2.55 at tcp portal 12 match 1359226700706636 offset 0 length 192: 2 odfs001.dmesg:Lustre: 6818:0:(lib-move.c:1826:lnet_parse_put()) Dropping PUT from 12345-192.168.50.55 at o2ib portal 26 match 1359226700706643 offset 0 length 368: 2 odfs001.dmesg:Lustre: 6846:0:(lib-move.c:1826:lnet_parse_put()) Dropping PUT from 12345-10.4.2.55 at tcp portal 12 match 1359226700706652 offset 0 length 368: 2 odfs001.dmesg:Lustre: 6820:0:(lib-move.c:1826:lnet_parse_put()) Dropping PUT from 12345-192.168.50.56 at o2ib portal 26 match 1359136284362079 offset 0 length 192: 2 odfs001.dmesg:Lustre: 6843:0:(lib-move.c:1826:lnet_parse_put()) Dropping PUT from 12345-10.4.2.56 at tcp portal 12 match 1359136284362080 offset 0 length 192: 2 odfs001.dmesg:Lustre: 6810:0:(lib-move.c:1826:lnet_parse_put()) Dropping PUT from 12345-192.168.50.56 at o2ib portal 26 match 1359136284362088 offset 0 length 368: 2 --------------------------------------------------------------- This email, along with any attachments, is confidential. If you believe you received this message in error, please contact the sender immediately and delete all copies of the message. Thank you.
On Thu, Mar 10, 2011 at 10:02:03AM -0600, Daniel Mayfield wrote:> Pid: 12935, comm: ptlrpcd-brw Not tainted 2.6.32.28-2.rgm #1 PowerEdge R610^^^^^^^^^^^ ^^^^^^^^^^^^^^^ Please note that 2.0 does not support 2.6.32 kernels. 2.1 will.> RIP: 0010:[<ffffffff811ebf93>] [<ffffffff811ebf93>] sg_next+0x3/0x30 > Call Trace: > [<ffffffffa094dc3e>] ? kiblnd_map_tx+0x1be/0x430 [ko2iblnd] > [<ffffffffa094e8a1>] ? kiblnd_queue_tx_locked+0x91/0x2b0 [ko2iblnd]This is a bug introduced by one of the patch from bug 21951 (attachment 29114). This patch was then reverted from both b1_8 (see 23123) and master (see 23332), but it is unfortunately included in 2.0.0. Cheers, Johann
So if I move my clients to 1.8.5, I should be ok? The servers are running the 2.6.18 kernel distributed with 2.0. Daniel On Mar 10, 2011, at 11:19, Johann Lombardi <johann at whamcloud.com> wrote:> On Thu, Mar 10, 2011 at 10:02:03AM -0600, Daniel Mayfield wrote: >> Pid: 12935, comm: ptlrpcd-brw Not tainted 2.6.32.28-2.rgm #1 PowerEdge R610 > ^^^^^^^^^^^ ^^^^^^^^^^^^^^^ > Please note that 2.0 does not support 2.6.32 kernels. 2.1 will. > >> RIP: 0010:[<ffffffff811ebf93>] [<ffffffff811ebf93>] sg_next+0x3/0x30 >> Call Trace: >> [<ffffffffa094dc3e>] ? kiblnd_map_tx+0x1be/0x430 [ko2iblnd] >> [<ffffffffa094e8a1>] ? kiblnd_queue_tx_locked+0x91/0x2b0 [ko2iblnd] > > This is a bug introduced by one of the patch from bug 21951 (attachment 29114). > This patch was then reverted from both b1_8 (see 23123) and master (see 23332), > but it is unfortunately included in 2.0.0. > > Cheers, > Johann--------------------------------------------------------------- This email, along with any attachments, is confidential. If you believe you received this message in error, please contact the sender immediately and delete all copies of the message. Thank you.
On Thu, Mar 10, 2011 at 01:12:06PM -0600, Daniel Mayfield wrote:> So if I move my clients to 1.8.5, I should be ok? The servers are running the 2.6.18 kernel distributed with 2.0.Yes, this should work. That said, although interoperability between 1.8 client & 2.0 server is regularly tested, i am not aware of anyone using it in production. In any case, this would just be a temporary solution until 2.1 is available. Cheers, Johann -- Johann Lombardi Whamcloud, Inc. www.whamcloud.com