Vipul Pandya
2010-Feb-12 13:53 UTC
[Lustre-discuss] Lustre-1.8.1.1 over o2ib gives Input/Output error while executing lctl ping
Hi All, I am trying to run Lustre over iWARP. For this I have compiled Lustre-1.8.1.1 with linux-2.6.18-128.7.1 source and OFED-1.5 source. I have installed all the required rpms for lustre. After this I booted into the lustre patched kernel and gave the following option in /etc/modprobe.conf for lnet to work with o2ib #> cat /etc/modprobe.conf options lnet networks="o2ib0(eth2)" I loaded our RDMA adapter modules and the lnet and ko2iblnd modules as follows: #> modprobe cxgb3 #> modprobe iw_cxgb3 #> modprobe rdma_ucm #> modprobe lnet #> modprobe ko2iblnd I was able to load all the modules successfully. Then I assigned the ip address to eth2 interface and brought it up #> ifconfig eth2 102.88.88.188/24 up #> ifconfig eth0 Link encap:Ethernet HWaddr 00:30:48:C7:8F:8E inet addr:10.193.184.188 Bcast:10.193.187.255 Mask:255.255.252.0 inet6 addr: fe80::230:48ff:fec7:8f8e/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:13224 errors:0 dropped:0 overruns:0 frame:0 TX packets:797 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:1523344 (1.4 MiB) TX bytes:203205 (198.4 KiB) Memory:dea20000-dea40000 eth2 Link encap:Ethernet HWaddr 00:07:43:05:07:35 inet addr:102.88.88.188 Bcast:102.88.88.255 Mask:255.255.255.0 inet6 addr: fe80::207:43ff:fe05:735/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:153 errors:0 dropped:0 overruns:0 frame:0 TX packets:47 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:22537 (22.0 KiB) TX bytes:8500 (8.3 KiB) Interrupt:185 Memory:de801000-de801fff lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:1607 errors:0 dropped:0 overruns:0 frame:0 TX packets:1607 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:3196948 (3.0 MiB) TX bytes:3196948 (3.0 MiB) After this I tried to bring the lnet network up as follows: #> lctl network up LNET configured Above command gave me following error in dmesg #> dmesg Lustre: Listener bound to eth2:102.88.88.188:987:cxgb3_0 Lustre: Register global MR array, MR size: 0xffffffff, array size: 2 fmr_pool: Device cxgb3_0 does not support FMRs LustreError: 4134:0:(o2iblnd.c:1393:kiblnd_create_fmr_pool()) Failed to create FMR pool: -38 Lustre: Added LNI 102.88.88.188 at o2ib [8/64/0/0] I repeat the same procedure on the other node of lustre and found the same result. Then I tried to do lctl ping between two nodes of lustre, which gave me following error: #> lctl ping 102.88.88.184 at o2ib failed to ping 102.88.88.184 at o2ib: Input/output error dmesg has shown following error: #> dmesg LustreError: 2453:0:(o2iblnd.c:801:kiblnd_create_conn()) Can''t create QP: -12, send_wr: 2056, recv_wr: 18 Lustre: 2453:0:(o2iblnd_cb.c:1953:kiblnd_peer_connect_failed()) Deleting messages for 102.88.88.184 at o2ib: connection failed I found one thread where it has given the patch to support FMR in o2ib. But I don''t think this patch is applicable for lustre-1.8.1.1. http://lists.lustre.org/pipermail/lustre-discuss/2008-February/006502.ht ml Can anyone please guide me on this. Thank you very much in advance. Vipul -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100212/8273072a/attachment.html
rishi pathak
2010-Feb-15 06:53 UTC
[Lustre-discuss] Lustre-1.8.1.1 over o2ib gives Input/Output error while executing lctl ping
Hello Vipul, On Fri, Feb 12, 2010 at 7:23 PM, Vipul Pandya <vipul at chelsio.com> wrote:> Hi All, > > > > I am trying to run Lustre over iWARP. For this I have compiled > Lustre-1.8.1.1 with linux-2.6.18-128.7.1 source and OFED-1.5 source. > > I have installed all the required rpms for lustre. > > > > After this I booted into the lustre patched kernel and gave the following > option in /etc/modprobe.conf for lnet to work with o2ib > > #> cat /etc/modprobe.conf > > options lnet networks="o2ib0(eth2)" >I am not familiar with Lustre over iWARP interconnect but still is eth2 the device associated with IP over iWARP .> > > I loaded our RDMA adapter modules and the lnet and ko2iblnd modules as > follows: > > #> modprobe cxgb3 > > #> modprobe iw_cxgb3 > > #> modprobe rdma_ucm > > #> modprobe lnet > > #> modprobe ko2iblnd > > > > I was able to load all the modules successfully. > > > > Then I assigned the ip address to eth2 interface and brought it up > > #> ifconfig eth2 102.88.88.188/24 up > > #> ifconfig > > eth0 Link encap:Ethernet HWaddr 00:30:48:C7:8F:8E > > inet addr:10.193.184.188 Bcast:10.193.187.255 > Mask:255.255.252.0 > > inet6 addr: fe80::230:48ff:fec7:8f8e/64 Scope:Link > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > > RX packets:13224 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:797 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:1000 > > RX bytes:1523344 (1.4 MiB) TX bytes:203205 (198.4 KiB) > > Memory:dea20000-dea40000 > > > > eth2 Link encap:Ethernet HWaddr 00:07:43:05:07:35 > > inet addr:102.88.88.188 Bcast:102.88.88.255 Mask:255.255.255.0 > > inet6 addr: fe80::207:43ff:fe05:735/64 Scope:Link > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > > RX packets:153 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:47 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:1000 > > RX bytes:22537 (22.0 KiB) TX bytes:8500 (8.3 KiB) > > Interrupt:185 Memory:de801000-de801fff > > > > lo Link encap:Local Loopback > > inet addr:127.0.0.1 Mask:255.0.0.0 > > inet6 addr: ::1/128 Scope:Host > > UP LOOPBACK RUNNING MTU:16436 Metric:1 > > RX packets:1607 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:1607 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:0 > > RX bytes:3196948 (3.0 MiB) TX bytes:3196948 (3.0 MiB) > > > > After this I tried to bring the lnet network up as follows: > > #> lctl network up > > LNET configured > > > > Above command gave me following error in dmesg > > #> dmesg > > Lustre: Listener bound to eth2:102.88.88.188:987:cxgb3_0 > > Lustre: Register global MR array, MR size: 0xffffffff, array size: 2 > > fmr_pool: Device cxgb3_0 does not support FMRs > > LustreError: 4134:0:(o2iblnd.c:1393:kiblnd_create_fmr_pool()) Failed to > create FMR pool: -38 > > Lustre: Added LNI 102.88.88.188 at o2ib [8/64/0/0] > > > > I repeat the same procedure on the other node of lustre and found the same > result. > > Then I tried to do lctl ping between two nodes of lustre, which gave me > following error: > > > > #> lctl ping 102.88.88.184 at o2ib > > failed to ping 102.88.88.184 at o2ib: Input/output error > > > > dmesg has shown following error: > > #> dmesg > > LustreError: 2453:0:(o2iblnd.c:801:kiblnd_create_conn()) Can''t create QP: > -12, send_wr: 2056, recv_wr: 18 > > Lustre: 2453:0:(o2iblnd_cb.c:1953:kiblnd_peer_connect_failed()) Deleting > messages for 102.88.88.184 at o2ib: connection failed > > > > I found one thread where it has given the patch to support FMR in o2ib. But > I don?t think this patch is applicable for lustre-1.8.1.1. > > http://lists.lustre.org/pipermail/lustre-discuss/2008-February/006502.html > > > > Can anyone please guide me on this. > > > > Thank you very much in advance. > > Vipul > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >-- Regards-- Rishi Pathak National PARAM Supercomputing Facility Center for Development of Advanced Computing(C-DAC) Pune University Campus,Ganesh Khind Road Pune-Maharastra -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100215/00271e02/attachment-0001.html
Vipul Pandya
2010-Feb-15 07:01 UTC
[Lustre-discuss] Lustre-1.8.1.1 over o2ib gives Input/Output error while executing lctl ping
Hi Rishi, First of all, thanks for your response. Yes, eth2 is the device associated with IP over iWARP. Thanks, Vipul From: rishi pathak [mailto:mailmaverick666 at gmail.com] Sent: 15 February 2010 12:23 To: Vipul Pandya Cc: lustre-discuss at lists.lustre.org Subject: Re: [Lustre-discuss] Lustre-1.8.1.1 over o2ib gives Input/Output error while executing lctl ping Hello Vipul, On Fri, Feb 12, 2010 at 7:23 PM, Vipul Pandya <vipul at chelsio.com> wrote: Hi All, I am trying to run Lustre over iWARP. For this I have compiled Lustre-1.8.1.1 with linux-2.6.18-128.7.1 source and OFED-1.5 source. I have installed all the required rpms for lustre. After this I booted into the lustre patched kernel and gave the following option in /etc/modprobe.conf for lnet to work with o2ib #> cat /etc/modprobe.conf options lnet networks="o2ib0(eth2)" I am not familiar with Lustre over iWARP interconnect but still is eth2 the device associated with IP over iWARP . I loaded our RDMA adapter modules and the lnet and ko2iblnd modules as follows: #> modprobe cxgb3 #> modprobe iw_cxgb3 #> modprobe rdma_ucm #> modprobe lnet #> modprobe ko2iblnd I was able to load all the modules successfully. Then I assigned the ip address to eth2 interface and brought it up #> ifconfig eth2 102.88.88.188/24 up #> ifconfig eth0 Link encap:Ethernet HWaddr 00:30:48:C7:8F:8E inet addr:10.193.184.188 Bcast:10.193.187.255 Mask:255.255.252.0 inet6 addr: fe80::230:48ff:fec7:8f8e/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:13224 errors:0 dropped:0 overruns:0 frame:0 TX packets:797 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:1523344 (1.4 MiB) TX bytes:203205 (198.4 KiB) Memory:dea20000-dea40000 eth2 Link encap:Ethernet HWaddr 00:07:43:05:07:35 inet addr:102.88.88.188 Bcast:102.88.88.255 Mask:255.255.255.0 inet6 addr: fe80::207:43ff:fe05:735/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:153 errors:0 dropped:0 overruns:0 frame:0 TX packets:47 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:22537 (22.0 KiB) TX bytes:8500 (8.3 KiB) Interrupt:185 Memory:de801000-de801fff lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:1607 errors:0 dropped:0 overruns:0 frame:0 TX packets:1607 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:3196948 (3.0 MiB) TX bytes:3196948 (3.0 MiB) After this I tried to bring the lnet network up as follows: #> lctl network up LNET configured Above command gave me following error in dmesg #> dmesg Lustre: Listener bound to eth2:102.88.88.188:987:cxgb3_0 Lustre: Register global MR array, MR size: 0xffffffff, array size: 2 fmr_pool: Device cxgb3_0 does not support FMRs LustreError: 4134:0:(o2iblnd.c:1393:kiblnd_create_fmr_pool()) Failed to create FMR pool: -38 Lustre: Added LNI 102.88.88.188 at o2ib [8/64/0/0] I repeat the same procedure on the other node of lustre and found the same result. Then I tried to do lctl ping between two nodes of lustre, which gave me following error: #> lctl ping 102.88.88.184 at o2ib failed to ping 102.88.88.184 at o2ib: Input/output error dmesg has shown following error: #> dmesg LustreError: 2453:0:(o2iblnd.c:801:kiblnd_create_conn()) Can''t create QP: -12, send_wr: 2056, recv_wr: 18 Lustre: 2453:0:(o2iblnd_cb.c:1953:kiblnd_peer_connect_failed()) Deleting messages for 102.88.88.184 at o2ib: connection failed I found one thread where it has given the patch to support FMR in o2ib. But I don''t think this patch is applicable for lustre-1.8.1.1. http://lists.lustre.org/pipermail/lustre-discuss/2008-February/006502.ht ml Can anyone please guide me on this. Thank you very much in advance. Vipul _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Regards-- Rishi Pathak National PARAM Supercomputing Facility Center for Development of Advanced Computing(C-DAC) Pune University Campus,Ganesh Khind Road Pune-Maharastra -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100214/20973f1e/attachment.html
Isaac Huang
2010-Feb-16 05:16 UTC
[Lustre-discuss] Lustre-1.8.1.1 over o2ib gives Input/Output error while executing lctl ping
On Fri, Feb 12, 2010 at 05:53:19AM -0800, Vipul Pandya wrote:> ...... > #> lctl network up > LNET configured > Above command gave me following error in dmesg > #> dmesg > > Lustre: Listener bound to eth2:102.88.88.188:987:cxgb3_0 > Lustre: Register global MR array, MR size: 0xffffffff, array size: 2 > fmr_pool: Device cxgb3_0 does not support FMRs > LustreError: 4134:0:(o2iblnd.c:1393:kiblnd_create_fmr_pool()) Failed to > create FMR pool: -38ib_create_fmr_pool() returned -ENOSYS, probably the HCA didn''t support FMR; this was not an fatal error.> Lustre: Added LNI 102.88.88.188 at o2ib [8/64/0/0] > > #> lctl ping 102.88.88.184 at o2ib > failed to ping 102.88.88.184 at o2ib: Input/output error > dmesg has shown following error: > #> dmesg > LustreError: 2453:0:(o2iblnd.c:801:kiblnd_create_conn()) Can''t create > QP: -12, send_wr: 2056, recv_wr: 18rdma_create_qp() returned -ENOMEM; most likely init_qp_attr->cap.max_send_wr was too big (2056) and needed too much memory.> Lustre: 2453:0:(o2iblnd_cb.c:1953:kiblnd_peer_connect_failed()) > Deleting messages for 102.88.88.184 at o2ib: connection failedYou''d need to use the o2iblnd map-on-demand feature. To find out whether your ko2iblnd module supports it: modinfo ko2iblnd | grep map_on_demand If yes, please try: options ko2iblnd map_on_demand=64 Thanks, Isaac
Vipul Pandya
2010-Feb-16 05:45 UTC
[Lustre-discuss] Lustre-1.8.1.1 over o2ib gives Input/Output error while executing lctl ping
Hello Issac, My ko2iblnd module supports map_on_demand option as shown below: [root at nizam ~]# modinfo ko2iblnd filename: /lib/modules/2.6.18-128.7.1.el5_lustre.1.8.1.1smp/kernel/net/lustre/ko2i blnd.ko license: GPL description: Kernel OpenIB gen2 LND v2.00 author: Sun Microsystems, Inc. <http://www.lustre.org/> srcversion: 069AA2BBD411996C8DF36DD depends: libcfs,lnet,ib_core,rdma_cm vermagic: 2.6.18-128.7.1.el5_lustre.1.8.1.1smp SMP mod_unload gcc-4.1 parm: service:service number (within RDMA_PS_TCP) (int) parm: cksum:set non-zero to enable message (not RDMA) checksums (int) parm: timeout:timeout (seconds) (int) parm: ntx:# of message descriptors (int) parm: credits:# concurrent sends (int) parm: peer_credits:# concurrent sends to 1 peer (int) parm: peer_credits_hiw:when eagerly to return credits (int) parm: peer_buffer_credits:# per-peer router buffer credits (int) parm: peer_timeout:Seconds without aliveness news to declare peer dead (<=0 to disable) (int) parm: ipif_name:IPoIB interface name (charp) parm: retry_count:Retransmissions when no ACK received (int) parm: rnr_retry_count:RNR retransmissions (int) parm: keepalive:Idle time in seconds before sending a keepalive (int) parm: ib_mtu:IB MTU 256/512/1024/2048/4096 (int) parm: concurrent_sends:send work-queue sizing (int) parm: map_on_demand:map on demand (int) parm: fmr_pool_size:size of the fmr pool (>= ntx / 4) (int) parm: fmr_flush_trigger:# dirty FMRs that triggers pool flush (int) parm: fmr_cache:non-zero to enable FMR caching (int) parm: pmr_pool_size:size of the MR cache pmr pool (int) -> I tried to load the ko2iblnd module as you have suggested. But still I am unable to do ''lctl ping''. I am getting the same error as shown below. #> modprobe ko2iblnd map_on_demand=64 #> modprobe lnet #> lctl ping 102.88.88.184 at o2ib failed to ping 102.88.88.184 at o2ib: Input/output error #> dmesg Lustre: Listener bound to eth2:102.88.88.188:987:cxgb3_0 Lustre: Register global MR array, MR size: 0xffffffff, array size: 2 fmr_pool: Device cxgb3_0 does not support FMRs LustreError: 4122:0:(o2iblnd.c:1393:kiblnd_create_fmr_pool()) Failed to create FMR pool: -38 Lustre: Added LNI 102.88.88.188 at o2ib [8/64/0/0] LustreError: 2453:0:(o2iblnd.c:801:kiblnd_create_conn()) Can''t create QP: -12, send_wr: 520, recv_wr: 18 Lustre: 2453:0:(o2iblnd_cb.c:1953:kiblnd_peer_connect_failed()) Deleting messages for 102.88.88.184 at o2ib: connection faile I would be grateful if you can provide some more thoughts on this. Please let me know if you require any further debugging information. Thanks, Vipul -----Original Message----- From: He.Huang at Sun.COM [mailto:He.Huang at Sun.COM] Sent: 16 February 2010 10:47 To: Vipul Pandya Cc: lustre-discuss at lists.lustre.org Subject: Re: [Lustre-discuss] Lustre-1.8.1.1 over o2ib gives Input/Output error while executing lctl ping On Fri, Feb 12, 2010 at 05:53:19AM -0800, Vipul Pandya wrote:> ...... > #> lctl network up > LNET configured > Above command gave me following error in dmesg > #> dmesg > > Lustre: Listener bound to eth2:102.88.88.188:987:cxgb3_0 > Lustre: Register global MR array, MR size: 0xffffffff, array size:2> fmr_pool: Device cxgb3_0 does not support FMRs > LustreError: 4134:0:(o2iblnd.c:1393:kiblnd_create_fmr_pool())Failed to> create FMR pool: -38ib_create_fmr_pool() returned -ENOSYS, probably the HCA didn''t support FMR; this was not an fatal error.> Lustre: Added LNI 102.88.88.188 at o2ib [8/64/0/0] > > #> lctl ping 102.88.88.184 at o2ib > failed to ping 102.88.88.184 at o2ib: Input/output error > dmesg has shown following error: > #> dmesg > LustreError: 2453:0:(o2iblnd.c:801:kiblnd_create_conn()) Can''tcreate> QP: -12, send_wr: 2056, recv_wr: 18rdma_create_qp() returned -ENOMEM; most likely init_qp_attr->cap.max_send_wr was too big (2056) and needed too much memory.> Lustre: 2453:0:(o2iblnd_cb.c:1953:kiblnd_peer_connect_failed()) > Deleting messages for 102.88.88.184 at o2ib: connection failedYou''d need to use the o2iblnd map-on-demand feature. To find out whether your ko2iblnd module supports it: modinfo ko2iblnd | grep map_on_demand If yes, please try: options ko2iblnd map_on_demand=64 Thanks, Isaac
Isaac Huang
2010-Feb-16 15:59 UTC
[Lustre-discuss] Lustre-1.8.1.1 over o2ib gives Input/Output error while executing lctl ping
On Mon, Feb 15, 2010 at 09:45:10PM -0800, Vipul Pandya wrote:> ...... > -> I tried to load the ko2iblnd module as you have suggested. But still > I am unable to do ''lctl ping''. I am getting the same error as shown > below. > #> modprobe ko2iblnd map_on_demand=64Please lower it to "map_on_demand=32".> #> modprobe lnet > #> lctl ping 102.88.88.184 at o2ib > failed to ping 102.88.88.184 at o2ib: Input/output error > #> dmesg > Lustre: Listener bound to eth2:102.88.88.188:987:cxgb3_0 > Lustre: Register global MR array, MR size: 0xffffffff, array size: 2 > fmr_pool: Device cxgb3_0 does not support FMRs > LustreError: 4122:0:(o2iblnd.c:1393:kiblnd_create_fmr_pool()) Failed to > create FMR pool: -38 > Lustre: Added LNI 102.88.88.188 at o2ib [8/64/0/0] > LustreError: 2453:0:(o2iblnd.c:801:kiblnd_create_conn()) Can''t create > QP: -12, send_wr: 520, recv_wr: 18 > Lustre: 2453:0:(o2iblnd_cb.c:1953:kiblnd_peer_connect_failed()) Deleting > messages for 102.88.88.184 at o2ib: connection failerdma_create_qp() failed with -ENOMEM again, even with a much smaller send_wr (520 vs 2056). If lowering map_on_demand still couldn''t fix it, you''d need to look into HCA driver/firmware as to why it failed to create the QP (if there''s enough memory for it). Isaac
Vipul Pandya
2010-Feb-22 11:22 UTC
[Lustre-discuss] Lustre-1.8.1.1 over o2ib gives Input/Output error while executing lctl ping
Hello Issac, Thank you very much for your response. I lowered the map_on_demand value to 16 and now it works fine. However, I had once concern, whether lowering down this map_on_demand value would impact the performance of Lustre or not? Thanks again. Vipul -----Original Message----- From: He.Huang at Sun.COM [mailto:He.Huang at Sun.COM] Sent: 16 February 2010 21:29 To: Vipul Pandya Cc: lustre-discuss at lists.lustre.org Subject: Re: [Lustre-discuss] Lustre-1.8.1.1 over o2ib gives Input/Output error while executing lctl ping On Mon, Feb 15, 2010 at 09:45:10PM -0800, Vipul Pandya wrote:> ...... > -> I tried to load the ko2iblnd module as you have suggested. Butstill> I am unable to do ''lctl ping''. I am getting the same error as shown > below. > #> modprobe ko2iblnd map_on_demand=64Please lower it to "map_on_demand=32".> #> modprobe lnet > #> lctl ping 102.88.88.184 at o2ib > failed to ping 102.88.88.184 at o2ib: Input/output error > #> dmesg > Lustre: Listener bound to eth2:102.88.88.188:987:cxgb3_0 > Lustre: Register global MR array, MR size: 0xffffffff, array size: 2 > fmr_pool: Device cxgb3_0 does not support FMRs > LustreError: 4122:0:(o2iblnd.c:1393:kiblnd_create_fmr_pool()) Failedto> create FMR pool: -38 > Lustre: Added LNI 102.88.88.188 at o2ib [8/64/0/0] > LustreError: 2453:0:(o2iblnd.c:801:kiblnd_create_conn()) Can''t create > QP: -12, send_wr: 520, recv_wr: 18 > Lustre: 2453:0:(o2iblnd_cb.c:1953:kiblnd_peer_connect_failed())Deleting> messages for 102.88.88.184 at o2ib: connection failerdma_create_qp() failed with -ENOMEM again, even with a much smaller send_wr (520 vs 2056). If lowering map_on_demand still couldn''t fix it, you''d need to look into HCA driver/firmware as to why it failed to create the QP (if there''s enough memory for it). Isaac
Isaac Huang
2010-Feb-25 20:56 UTC
[Lustre-discuss] Lustre-1.8.1.1 over o2ib gives Input/Output error while executing lctl ping
On Mon, Feb 22, 2010 at 03:22:52AM -0800, Vipul Pandya wrote:> Hello Issac,Hi Vipul,> ...... > I lowered the map_on_demand value to 16 and now it works fine. > > However, I had once concern, whether lowering down this map_on_demand > value would impact the performance of Lustre or not?For iWARP, you probably have no alternative. I remembered that there''s a restriction somewhere in the iWARP stack that limits the size of SQs (which was why the rdma_create_qp errors happened), and lowering map_on_demand is the only way to reduce Lustre SQ length. For infiniband, lowering map_on_demand essentially reduces the # of RDMA WQE needed for each Lustre bulk data movement, at the cost of memory registration/deregistration at most per bulk transfer; without map_on_demand the o2iblnd uses a static MR so there''s no memory registration cost. There could a point in the # of frags of the bulk buffer, where the cost of handling RDMA WQEs (which usually equals the # of frags) exceeds the cost of MR, and that''s what you should set map_on_demand to. However, since both costs are mostly determined by HCA hardware/firmware implementation, there''s no one good setting for all interconnects, and you can only find it by testing. The LNet selftest is a useful tool for running such tests: http://manual.lustre.org/manual/LustreManual16_HTML/LustreIOKit.html#50610302_36273 Hope this helps, Isaac
Vipul Pandya
2010-Feb-26 08:56 UTC
[Lustre-discuss] Lustre-1.8.1.1 over o2ib gives Input/Output error while executing lctl ping
Hi Issac, This was very helpful. Thanks a lot for your response. Vipul -----Original Message----- From: He.Huang at Sun.COM [mailto:He.Huang at Sun.COM] Sent: 26 February 2010 02:27 To: Vipul Pandya Cc: lustre-discuss at lists.lustre.org Subject: Re: [Lustre-discuss] Lustre-1.8.1.1 over o2ib gives Input/Output error while executing lctl ping On Mon, Feb 22, 2010 at 03:22:52AM -0800, Vipul Pandya wrote:> Hello Issac,Hi Vipul,> ...... > I lowered the map_on_demand value to 16 and now it works fine. > > However, I had once concern, whether lowering down this map_on_demand > value would impact the performance of Lustre or not?For iWARP, you probably have no alternative. I remembered that there''s a restriction somewhere in the iWARP stack that limits the size of SQs (which was why the rdma_create_qp errors happened), and lowering map_on_demand is the only way to reduce Lustre SQ length. For infiniband, lowering map_on_demand essentially reduces the # of RDMA WQE needed for each Lustre bulk data movement, at the cost of memory registration/deregistration at most per bulk transfer; without map_on_demand the o2iblnd uses a static MR so there''s no memory registration cost. There could a point in the # of frags of the bulk buffer, where the cost of handling RDMA WQEs (which usually equals the # of frags) exceeds the cost of MR, and that''s what you should set map_on_demand to. However, since both costs are mostly determined by HCA hardware/firmware implementation, there''s no one good setting for all interconnects, and you can only find it by testing. The LNet selftest is a useful tool for running such tests: http://manual.lustre.org/manual/LustreManual16_HTML/LustreIOKit.html#506 10302_36273 Hope this helps, Isaac