>>>>> "Kumaran" == Kumaran Rajaram
<krajaram@lnxi.com> writes:
Kumaran,
Kumaran> Roland, Lustre''s openibnal has issues with mem-based
Kumaran> Mellanox card (mod_thh driver). Workaround is to use
Kumaran> memfree Mellanox card that uses mod_rhh driver. Please
Kumaran> refer to #7246 in Lustre Bugzilla for more info:
Kumaran> https://bugzilla.clusterfs.com/show_bug.cgi?id=7246
thanks for pointing me to this. Our situation confirms what is
mentioned in the bugzilla thread: Our OST/MDSs are running a mem-free
PCI-e cards, and are stable. The clients are running a mem-full PCI-X
card and crash. Strange that this bug is so old, and still
unresolved. I thought Lustre over Infiniband was a too hot topic to
let this slip for so long. Is there a solution in sight? I guess it
might be worthwhile to contact our Mellanox reps about this as well ...
Thanks,
Roland
>>>> Roland Fehrenbacher <rf@q-leap.de> 04/03/06 2:39 pm
>>>
Kumaran> Hi,
Kumaran> we were starting to stress-test the openib interface of
Kumaran> Lustre against a MDS/OST setup that works flawlessly via
Kumaran> the tcp interface. We use IBGD 1.8.2, and kernel 2.6.15.6
Kumaran> with our patches from http://mrvn.homeip.net/lustre/. The
Kumaran> infiniband stack itself works fine, as shown by extensive
Kumaran> testing using mvapich. We succeed in mounting the lov on
Kumaran> the client via the openib layer, but running an iozone
Kumaran> benchmark () results in a crash after about 10GB of the
Kumaran> intial write. Here is the oops message:
Kumaran>
-------------------------------------------------------------------------------
Kumaran> beo-03 login: [ 513.949692] LustreError:
Kumaran> 2008:0:(openiblnd_cb.c:264:kibnal_rx_callback())
Kumaran> ASSERTION(rx->rx_nob < 0) failed [ 513.961199]
Kumaran> LustreError:
Kumaran> 2008:0:(tracefile.c:254:libcfs_assertion_failed()) LBUG [
Kumaran> 513.969888] Lustre:
Kumaran> 2008:0:(linux-debug.c:132:libcfs_debug_dumpstack())
Kumaran> showing stack for process 2008 [ 513.987315] kibnal_sd_00
Kumaran> R running task 0 2008 1 2009 1811 (L-TLB) [ 513.997463]
Kumaran> 0000000000000002 00070000c0a83403 00070000c0a833fd
Kumaran> ffffffff8851d155 [ 514.006129] 0000000000000000
Kumaran> 00070000c0a833fd 0000000000000246 00000001ffc4f400 [
Kumaran> 514.015861] 0000000100000000 ffff8101ffd33270 [
Kumaran> 514.021994] Call
Kumaran> Trace:<ffffffff8851d155>{:lnet:lnet_parse+9181}
Kumaran> <ffffffff88513e26>{:lnet:lnet_complete_msg_locked+1079} [
Kumaran> 514.034960] <ffffffff88514061>{:lnet:lnet_finalize+473}
Kumaran> <ffffffff88654d70>{:kopeniblnd:kibnal_rx+487} [
Kumaran> 514.046741]
Kumaran> <ffffffff88659e2f>{:kopeniblnd:kibnal_scheduler+239} [
Kumaran> 514.054596]
Kumaran> <ffffffff801421ea>{autoremove_wake_function+0}
Kumaran> <ffffffff80131b1c>{do_exit+2998} [ 514.065382]
Kumaran> <ffffffff801421ea>{autoremove_wake_function+0}
Kumaran> <ffffffff8010e746>{child_rip+8} [ 514.076037]
Kumaran> <ffffffff88659d40>{:kopeniblnd:kibnal_scheduler+0}
Kumaran> <ffffffff8010e73e>{child_rip+0}
Kumaran>
-------------------------------------------------------------------------------
Kumaran> Can anybody with more knowledge shed some light on the
Kumaran> reason for the
Kumaran> LASSERT (rx->rx_nob < 0);
Kumaran> statement in openiblnd_cb.c, and give a clue how to solve
Kumaran> the problem?
Kumaran> Many thanks,
Kumaran> Roland
Kumaran> Lustre-discuss mailing list Lustre-discuss@clusterfs.com
Kumaran> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
Kumaran> <html> <head> <style
type="text/css"> <!-- body {
Kumaran> margin-right: 4px; margin-top: 4px; line-height: normal;
Kumaran> margin-left: 4px; font-variant: normal; margin-bottom:
Kumaran> 1px }
-->
Kumaran> </style>
Kumaran> </head> <body style="margin-right: 4px;
margin-top:
Kumaran> 4px; margin-left: 4px; margin-bottom: 1px"> <DIV>
Kumaran> Roland, </DIV> <DIV> </DIV>
<DIV>Lustre's
Kumaran> openibnal has issues with mem-based Mellanox card
Kumaran> (mod_thh driver). Workaround is to use memfree
Kumaran> Mellanox card that uses mod_rhh driver. Please refer to
Kumaran> #7246 in Lustre Bugzilla for more info: </DIV>
Kumaran> <DIV> </DIV>
Kumaran>
<DIV>https://bugzilla.clusterfs.com/show_bug.cgi?id=7246
Kumaran> </DIV> <DIV> </DIV>
<DIV>HTH, </DIV>
Kumaran>
<DIV>-Kums<br><br>>>>Roland Fehrenbacher
Kumaran> <rf@q-leap.de> 04/03/06 2:39 pm
Kumaran>
>>><br>Hi,<br><br>we were
starting to
Kumaran> stress-test the openib interface of Lustre against<br>a
Kumaran> MDS/OST setup that works flawlessly via the tcp
Kumaran> interface. We use<br>IBGD 1.8.2, and kernel
2.6.15.6
Kumaran> with our patches
Kumaran> from<br>http://mrvn.homeip.net/lustre/. The infiniband
Kumaran> stack itself works<br>fine, as shown by extensive
Kumaran> testing using mvapich. We succeed in<br>mounting the lov
Kumaran> on the client via the openib layer, but running
Kumaran> an<br>iozone benchmark () results in a
crash
Kumaran> after about 10GB of the intial<br>write. Here is the oops
Kumaran>
message:<br><br>-------------------------------------------------------------------------------<br>beo-03
Kumaran> login: [  513.949692]
LustreError:
Kumaran>
2008:0:(openiblnd_cb.c:264:kibnal_rx_callback())
Kumaran> ASSERTION(rx->rx_nob < 0)
Kumaran> failed<br>[  513.961199]
LustreError:
Kumaran>
2008:0:(tracefile.c:254:libcfs_assertion_failed())
Kumaran> LBUG<br>[  513.969888]
Lustre:
Kumaran>
2008:0:(linux-debug.c:132:libcfs_debug_dumpstack())
Kumaran> showing stack for process
Kumaran> 2008<br>[  513.987315]
Kumaran> kibnal_sd_00  R  running
Kumaran>
task       0  2008      1          2009<br>1811
Kumaran>
(L-TLB)<br>[  513.997463]
Kumaran> 0000000000000002 00070000c0a83403 00070000c0a833fd
Kumaran>
ffffffff8851d155<br>[  514.006129]        0000000000000000
Kumaran> 00070000c0a833fd 0000000000000246
Kumaran>
00000001ffc4f400<br>[  514.015861]        0000000100000000
Kumaran>
ffff8101ffd33270<br>[  514.021994] Call
Kumaran>
Trace:<ffffffff8851d155>{:lnet:lnet_parse+9181} <ffffffff88513e26>{:lnet:lnet_complete_msg_locked+1079}<br>[  514.034960]        <ffffffff88514061>{:lnet:lnet_finalize+473} <ffffffff88654d70>{:kopeniblnd:kibnal_rx+487}<br>[  514.046741]        <ffffffff88659e2f>{:kopeniblnd:kibnal_scheduler+239}<br>[  514.054596]        <ffffffff801421ea>{autoremove_wake_function+0} <ffffffff80131b1c>{do_exit+2998}<br>[  514.065382]        <ffffffff801421ea>{autoremove_wake_function+0} <ffffffff8010e746>{child_rip+8}<br>[  514.076037]        <ffffffff88659d40>{:kopeniblnd:kibnal_scheduler+0} <ffffffff8010e73e>{child_rip+0}<br>-------------------------------------------------------------------------------<br><br>Can
Kumaran> anybody with more knowledge shed some light on the reason
Kumaran> for the<br><br>LASSERT (rx->rx_nob
<
Kumaran> 0);<br><br>statement in
openiblnd_cb.c, and
Kumaran> give a clue how to solve the
problem?<br><br>Many
Kumaran>
thanks,<br><br>Roland<br><br><br>Lustre-discuss
Kumaran> mailing
Kumaran>
list<br>Lustre-discuss@clusterfs.com<br>https://mail.clusterfs.com/mailman/listinfo/lustre-discuss<br>
Kumaran> </DIV>
Kumaran> </body> </html>