Vipul Pandya
2010-Feb-12  13:53 UTC
[Lustre-discuss] Lustre-1.8.1.1 over o2ib gives Input/Output error while executing lctl ping
Hi All,
 
I am trying to run Lustre over iWARP. For this I have compiled
Lustre-1.8.1.1 with linux-2.6.18-128.7.1 source and OFED-1.5 source.
I have installed all the required rpms for lustre.
 
After this I booted into  the lustre patched kernel and gave the
following option in /etc/modprobe.conf for lnet to work with o2ib
#> cat /etc/modprobe.conf
options lnet networks="o2ib0(eth2)"
 
I loaded our RDMA adapter modules and the lnet and ko2iblnd modules as
follows:
#> modprobe cxgb3
#> modprobe iw_cxgb3
#> modprobe rdma_ucm
#> modprobe lnet
#> modprobe ko2iblnd
 
I was able to load all the modules successfully.
 
Then I assigned the ip address to eth2 interface and brought it up
#> ifconfig eth2 102.88.88.188/24 up
#> ifconfig
eth0      Link encap:Ethernet  HWaddr 00:30:48:C7:8F:8E
          inet addr:10.193.184.188  Bcast:10.193.187.255
Mask:255.255.252.0
          inet6 addr: fe80::230:48ff:fec7:8f8e/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:13224 errors:0 dropped:0 overruns:0 frame:0
          TX packets:797 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:1523344 (1.4 MiB)  TX bytes:203205 (198.4 KiB)
          Memory:dea20000-dea40000
 
eth2      Link encap:Ethernet  HWaddr 00:07:43:05:07:35
          inet addr:102.88.88.188  Bcast:102.88.88.255
Mask:255.255.255.0
          inet6 addr: fe80::207:43ff:fe05:735/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:153 errors:0 dropped:0 overruns:0 frame:0
          TX packets:47 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:22537 (22.0 KiB)  TX bytes:8500 (8.3 KiB)
          Interrupt:185 Memory:de801000-de801fff
 
lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:1607 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1607 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:3196948 (3.0 MiB)  TX bytes:3196948 (3.0 MiB)
 
After this I tried to bring the lnet network up as follows:
#> lctl network up
LNET configured
 
Above command gave me following error in dmesg
#> dmesg
Lustre: Listener bound to eth2:102.88.88.188:987:cxgb3_0
Lustre: Register global MR array, MR size: 0xffffffff, array size: 2
fmr_pool: Device cxgb3_0 does not support FMRs
LustreError: 4134:0:(o2iblnd.c:1393:kiblnd_create_fmr_pool()) Failed to
create FMR pool: -38
Lustre: Added LNI 102.88.88.188 at o2ib [8/64/0/0]
 
I repeat the same procedure on the other node of lustre and found the
same result.
Then I tried to do lctl ping between two nodes of lustre, which gave me
following error:
 
#> lctl ping 102.88.88.184 at o2ib
failed to ping 102.88.88.184 at o2ib: Input/output error
 
dmesg has shown following error:
#> dmesg
LustreError: 2453:0:(o2iblnd.c:801:kiblnd_create_conn()) Can''t create
QP: -12, send_wr: 2056, recv_wr: 18
Lustre: 2453:0:(o2iblnd_cb.c:1953:kiblnd_peer_connect_failed()) Deleting
messages for 102.88.88.184 at o2ib: connection failed
 
I found one thread where it has given the patch to support FMR in o2ib.
But I don''t think this patch is applicable for lustre-1.8.1.1.
http://lists.lustre.org/pipermail/lustre-discuss/2008-February/006502.ht
ml
 
Can anyone please guide me on this.
 
Thank you very much in advance.
Vipul
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100212/8273072a/attachment.html
rishi pathak
2010-Feb-15  06:53 UTC
[Lustre-discuss] Lustre-1.8.1.1 over o2ib gives Input/Output error while executing lctl ping
Hello Vipul, On Fri, Feb 12, 2010 at 7:23 PM, Vipul Pandya <vipul at chelsio.com> wrote:> Hi All, > > > > I am trying to run Lustre over iWARP. For this I have compiled > Lustre-1.8.1.1 with linux-2.6.18-128.7.1 source and OFED-1.5 source. > > I have installed all the required rpms for lustre. > > > > After this I booted into the lustre patched kernel and gave the following > option in /etc/modprobe.conf for lnet to work with o2ib > > #> cat /etc/modprobe.conf > > options lnet networks="o2ib0(eth2)" >I am not familiar with Lustre over iWARP interconnect but still is eth2 the device associated with IP over iWARP .> > > I loaded our RDMA adapter modules and the lnet and ko2iblnd modules as > follows: > > #> modprobe cxgb3 > > #> modprobe iw_cxgb3 > > #> modprobe rdma_ucm > > #> modprobe lnet > > #> modprobe ko2iblnd > > > > I was able to load all the modules successfully. > > > > Then I assigned the ip address to eth2 interface and brought it up > > #> ifconfig eth2 102.88.88.188/24 up > > #> ifconfig > > eth0 Link encap:Ethernet HWaddr 00:30:48:C7:8F:8E > > inet addr:10.193.184.188 Bcast:10.193.187.255 > Mask:255.255.252.0 > > inet6 addr: fe80::230:48ff:fec7:8f8e/64 Scope:Link > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > > RX packets:13224 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:797 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:1000 > > RX bytes:1523344 (1.4 MiB) TX bytes:203205 (198.4 KiB) > > Memory:dea20000-dea40000 > > > > eth2 Link encap:Ethernet HWaddr 00:07:43:05:07:35 > > inet addr:102.88.88.188 Bcast:102.88.88.255 Mask:255.255.255.0 > > inet6 addr: fe80::207:43ff:fe05:735/64 Scope:Link > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > > RX packets:153 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:47 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:1000 > > RX bytes:22537 (22.0 KiB) TX bytes:8500 (8.3 KiB) > > Interrupt:185 Memory:de801000-de801fff > > > > lo Link encap:Local Loopback > > inet addr:127.0.0.1 Mask:255.0.0.0 > > inet6 addr: ::1/128 Scope:Host > > UP LOOPBACK RUNNING MTU:16436 Metric:1 > > RX packets:1607 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:1607 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:0 > > RX bytes:3196948 (3.0 MiB) TX bytes:3196948 (3.0 MiB) > > > > After this I tried to bring the lnet network up as follows: > > #> lctl network up > > LNET configured > > > > Above command gave me following error in dmesg > > #> dmesg > > Lustre: Listener bound to eth2:102.88.88.188:987:cxgb3_0 > > Lustre: Register global MR array, MR size: 0xffffffff, array size: 2 > > fmr_pool: Device cxgb3_0 does not support FMRs > > LustreError: 4134:0:(o2iblnd.c:1393:kiblnd_create_fmr_pool()) Failed to > create FMR pool: -38 > > Lustre: Added LNI 102.88.88.188 at o2ib [8/64/0/0] > > > > I repeat the same procedure on the other node of lustre and found the same > result. > > Then I tried to do lctl ping between two nodes of lustre, which gave me > following error: > > > > #> lctl ping 102.88.88.184 at o2ib > > failed to ping 102.88.88.184 at o2ib: Input/output error > > > > dmesg has shown following error: > > #> dmesg > > LustreError: 2453:0:(o2iblnd.c:801:kiblnd_create_conn()) Can''t create QP: > -12, send_wr: 2056, recv_wr: 18 > > Lustre: 2453:0:(o2iblnd_cb.c:1953:kiblnd_peer_connect_failed()) Deleting > messages for 102.88.88.184 at o2ib: connection failed > > > > I found one thread where it has given the patch to support FMR in o2ib. But > I don?t think this patch is applicable for lustre-1.8.1.1. > > http://lists.lustre.org/pipermail/lustre-discuss/2008-February/006502.html > > > > Can anyone please guide me on this. > > > > Thank you very much in advance. > > Vipul > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >-- Regards-- Rishi Pathak National PARAM Supercomputing Facility Center for Development of Advanced Computing(C-DAC) Pune University Campus,Ganesh Khind Road Pune-Maharastra -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100215/00271e02/attachment-0001.html
Vipul Pandya
2010-Feb-15  07:01 UTC
[Lustre-discuss] Lustre-1.8.1.1 over o2ib gives Input/Output error while executing lctl ping
Hi Rishi,
 
First of all, thanks for your response.
Yes, eth2 is the device associated with IP over iWARP.
 
Thanks,
Vipul
 
From: rishi pathak [mailto:mailmaverick666 at gmail.com] 
Sent: 15 February 2010 12:23
To: Vipul Pandya
Cc: lustre-discuss at lists.lustre.org
Subject: Re: [Lustre-discuss] Lustre-1.8.1.1 over o2ib gives
Input/Output error while executing lctl ping
 
Hello Vipul,
                
On Fri, Feb 12, 2010 at 7:23 PM, Vipul Pandya <vipul at chelsio.com>
wrote:
Hi All,
 
I am trying to run Lustre over iWARP. For this I have compiled
Lustre-1.8.1.1 with linux-2.6.18-128.7.1 source and OFED-1.5 source.
I have installed all the required rpms for lustre.
 
After this I booted into  the lustre patched kernel and gave the
following option in /etc/modprobe.conf for lnet to work with o2ib
#> cat /etc/modprobe.conf
options lnet networks="o2ib0(eth2)"
I am not familiar with Lustre over iWARP interconnect but still is eth2
the device associated with IP over iWARP .
	 
	I loaded our RDMA adapter modules and the lnet and ko2iblnd
modules as follows:
	#> modprobe cxgb3
	#> modprobe iw_cxgb3
	#> modprobe rdma_ucm
	#> modprobe lnet
	#> modprobe ko2iblnd
	 
	I was able to load all the modules successfully.
	 
	Then I assigned the ip address to eth2 interface and brought it
up
	#> ifconfig eth2 102.88.88.188/24 up
	#> ifconfig
	eth0      Link encap:Ethernet  HWaddr 00:30:48:C7:8F:8E
	          inet addr:10.193.184.188  Bcast:10.193.187.255
Mask:255.255.252.0
	          inet6 addr: fe80::230:48ff:fec7:8f8e/64 Scope:Link
	          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
	          RX packets:13224 errors:0 dropped:0 overruns:0 frame:0
	          TX packets:797 errors:0 dropped:0 overruns:0 carrier:0
	          collisions:0 txqueuelen:1000
	          RX bytes:1523344 (1.4 MiB)  TX bytes:203205 (198.4
KiB)
	          Memory:dea20000-dea40000
	 
	eth2      Link encap:Ethernet  HWaddr 00:07:43:05:07:35
	          inet addr:102.88.88.188  Bcast:102.88.88.255
Mask:255.255.255.0
	          inet6 addr: fe80::207:43ff:fe05:735/64 Scope:Link
	          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
	          RX packets:153 errors:0 dropped:0 overruns:0 frame:0
	          TX packets:47 errors:0 dropped:0 overruns:0 carrier:0
	          collisions:0 txqueuelen:1000
	          RX bytes:22537 (22.0 KiB)  TX bytes:8500 (8.3 KiB)
	          Interrupt:185 Memory:de801000-de801fff
	 
	lo        Link encap:Local Loopback
	          inet addr:127.0.0.1  Mask:255.0.0.0
	          inet6 addr: ::1/128 Scope:Host
	          UP LOOPBACK RUNNING  MTU:16436  Metric:1
	          RX packets:1607 errors:0 dropped:0 overruns:0 frame:0
	          TX packets:1607 errors:0 dropped:0 overruns:0
carrier:0
	          collisions:0 txqueuelen:0
	          RX bytes:3196948 (3.0 MiB)  TX bytes:3196948 (3.0 MiB)
	 
	After this I tried to bring the lnet network up as follows:
	#> lctl network up
	LNET configured
	 
	Above command gave me following error in dmesg
	#> dmesg
	Lustre: Listener bound to eth2:102.88.88.188:987:cxgb3_0
	Lustre: Register global MR array, MR size: 0xffffffff, array
size: 2
	fmr_pool: Device cxgb3_0 does not support FMRs
	LustreError: 4134:0:(o2iblnd.c:1393:kiblnd_create_fmr_pool())
Failed to create FMR pool: -38
	Lustre: Added LNI 102.88.88.188 at o2ib [8/64/0/0]
	 
	I repeat the same procedure on the other node of lustre and
found the same result.
	Then I tried to do lctl ping between two nodes of lustre, which
gave me following error:
	 
	#> lctl ping 102.88.88.184 at o2ib
	failed to ping 102.88.88.184 at o2ib: Input/output error
	 
	dmesg has shown following error:
	#> dmesg
	LustreError: 2453:0:(o2iblnd.c:801:kiblnd_create_conn()) Can''t
create QP: -12, send_wr: 2056, recv_wr: 18
	Lustre: 2453:0:(o2iblnd_cb.c:1953:kiblnd_peer_connect_failed())
Deleting messages for 102.88.88.184 at o2ib: connection failed
	 
	I found one thread where it has given the patch to support FMR
in o2ib. But I don''t think this patch is applicable for lustre-1.8.1.1.
	
http://lists.lustre.org/pipermail/lustre-discuss/2008-February/006502.ht
ml
	 
	Can anyone please guide me on this.
	 
	Thank you very much in advance.
	Vipul
	 
	
	_______________________________________________
	Lustre-discuss mailing list
	Lustre-discuss at lists.lustre.org
	http://lists.lustre.org/mailman/listinfo/lustre-discuss
-- 
Regards--
Rishi Pathak
National PARAM Supercomputing Facility
Center for Development of Advanced Computing(C-DAC)
Pune University Campus,Ganesh Khind Road
Pune-Maharastra
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100214/20973f1e/attachment.html
Isaac Huang
2010-Feb-16  05:16 UTC
[Lustre-discuss] Lustre-1.8.1.1 over o2ib gives Input/Output error while executing lctl ping
On Fri, Feb 12, 2010 at 05:53:19AM -0800, Vipul Pandya wrote:> ...... > #> lctl network up > LNET configured > Above command gave me following error in dmesg > #> dmesg > > Lustre: Listener bound to eth2:102.88.88.188:987:cxgb3_0 > Lustre: Register global MR array, MR size: 0xffffffff, array size: 2 > fmr_pool: Device cxgb3_0 does not support FMRs > LustreError: 4134:0:(o2iblnd.c:1393:kiblnd_create_fmr_pool()) Failed to > create FMR pool: -38ib_create_fmr_pool() returned -ENOSYS, probably the HCA didn''t support FMR; this was not an fatal error.> Lustre: Added LNI 102.88.88.188 at o2ib [8/64/0/0] > > #> lctl ping 102.88.88.184 at o2ib > failed to ping 102.88.88.184 at o2ib: Input/output error > dmesg has shown following error: > #> dmesg > LustreError: 2453:0:(o2iblnd.c:801:kiblnd_create_conn()) Can''t create > QP: -12, send_wr: 2056, recv_wr: 18rdma_create_qp() returned -ENOMEM; most likely init_qp_attr->cap.max_send_wr was too big (2056) and needed too much memory.> Lustre: 2453:0:(o2iblnd_cb.c:1953:kiblnd_peer_connect_failed()) > Deleting messages for 102.88.88.184 at o2ib: connection failedYou''d need to use the o2iblnd map-on-demand feature. To find out whether your ko2iblnd module supports it: modinfo ko2iblnd | grep map_on_demand If yes, please try: options ko2iblnd map_on_demand=64 Thanks, Isaac
Vipul Pandya
2010-Feb-16  05:45 UTC
[Lustre-discuss] Lustre-1.8.1.1 over o2ib gives Input/Output error while executing lctl ping
Hello Issac, My ko2iblnd module supports map_on_demand option as shown below: [root at nizam ~]# modinfo ko2iblnd filename: /lib/modules/2.6.18-128.7.1.el5_lustre.1.8.1.1smp/kernel/net/lustre/ko2i blnd.ko license: GPL description: Kernel OpenIB gen2 LND v2.00 author: Sun Microsystems, Inc. <http://www.lustre.org/> srcversion: 069AA2BBD411996C8DF36DD depends: libcfs,lnet,ib_core,rdma_cm vermagic: 2.6.18-128.7.1.el5_lustre.1.8.1.1smp SMP mod_unload gcc-4.1 parm: service:service number (within RDMA_PS_TCP) (int) parm: cksum:set non-zero to enable message (not RDMA) checksums (int) parm: timeout:timeout (seconds) (int) parm: ntx:# of message descriptors (int) parm: credits:# concurrent sends (int) parm: peer_credits:# concurrent sends to 1 peer (int) parm: peer_credits_hiw:when eagerly to return credits (int) parm: peer_buffer_credits:# per-peer router buffer credits (int) parm: peer_timeout:Seconds without aliveness news to declare peer dead (<=0 to disable) (int) parm: ipif_name:IPoIB interface name (charp) parm: retry_count:Retransmissions when no ACK received (int) parm: rnr_retry_count:RNR retransmissions (int) parm: keepalive:Idle time in seconds before sending a keepalive (int) parm: ib_mtu:IB MTU 256/512/1024/2048/4096 (int) parm: concurrent_sends:send work-queue sizing (int) parm: map_on_demand:map on demand (int) parm: fmr_pool_size:size of the fmr pool (>= ntx / 4) (int) parm: fmr_flush_trigger:# dirty FMRs that triggers pool flush (int) parm: fmr_cache:non-zero to enable FMR caching (int) parm: pmr_pool_size:size of the MR cache pmr pool (int) -> I tried to load the ko2iblnd module as you have suggested. But still I am unable to do ''lctl ping''. I am getting the same error as shown below. #> modprobe ko2iblnd map_on_demand=64 #> modprobe lnet #> lctl ping 102.88.88.184 at o2ib failed to ping 102.88.88.184 at o2ib: Input/output error #> dmesg Lustre: Listener bound to eth2:102.88.88.188:987:cxgb3_0 Lustre: Register global MR array, MR size: 0xffffffff, array size: 2 fmr_pool: Device cxgb3_0 does not support FMRs LustreError: 4122:0:(o2iblnd.c:1393:kiblnd_create_fmr_pool()) Failed to create FMR pool: -38 Lustre: Added LNI 102.88.88.188 at o2ib [8/64/0/0] LustreError: 2453:0:(o2iblnd.c:801:kiblnd_create_conn()) Can''t create QP: -12, send_wr: 520, recv_wr: 18 Lustre: 2453:0:(o2iblnd_cb.c:1953:kiblnd_peer_connect_failed()) Deleting messages for 102.88.88.184 at o2ib: connection faile I would be grateful if you can provide some more thoughts on this. Please let me know if you require any further debugging information. Thanks, Vipul -----Original Message----- From: He.Huang at Sun.COM [mailto:He.Huang at Sun.COM] Sent: 16 February 2010 10:47 To: Vipul Pandya Cc: lustre-discuss at lists.lustre.org Subject: Re: [Lustre-discuss] Lustre-1.8.1.1 over o2ib gives Input/Output error while executing lctl ping On Fri, Feb 12, 2010 at 05:53:19AM -0800, Vipul Pandya wrote:> ...... > #> lctl network up > LNET configured > Above command gave me following error in dmesg > #> dmesg > > Lustre: Listener bound to eth2:102.88.88.188:987:cxgb3_0 > Lustre: Register global MR array, MR size: 0xffffffff, array size:2> fmr_pool: Device cxgb3_0 does not support FMRs > LustreError: 4134:0:(o2iblnd.c:1393:kiblnd_create_fmr_pool())Failed to> create FMR pool: -38ib_create_fmr_pool() returned -ENOSYS, probably the HCA didn''t support FMR; this was not an fatal error.> Lustre: Added LNI 102.88.88.188 at o2ib [8/64/0/0] > > #> lctl ping 102.88.88.184 at o2ib > failed to ping 102.88.88.184 at o2ib: Input/output error > dmesg has shown following error: > #> dmesg > LustreError: 2453:0:(o2iblnd.c:801:kiblnd_create_conn()) Can''tcreate> QP: -12, send_wr: 2056, recv_wr: 18rdma_create_qp() returned -ENOMEM; most likely init_qp_attr->cap.max_send_wr was too big (2056) and needed too much memory.> Lustre: 2453:0:(o2iblnd_cb.c:1953:kiblnd_peer_connect_failed()) > Deleting messages for 102.88.88.184 at o2ib: connection failedYou''d need to use the o2iblnd map-on-demand feature. To find out whether your ko2iblnd module supports it: modinfo ko2iblnd | grep map_on_demand If yes, please try: options ko2iblnd map_on_demand=64 Thanks, Isaac
Isaac Huang
2010-Feb-16  15:59 UTC
[Lustre-discuss] Lustre-1.8.1.1 over o2ib gives Input/Output error while executing lctl ping
On Mon, Feb 15, 2010 at 09:45:10PM -0800, Vipul Pandya wrote:> ...... > -> I tried to load the ko2iblnd module as you have suggested. But still > I am unable to do ''lctl ping''. I am getting the same error as shown > below. > #> modprobe ko2iblnd map_on_demand=64Please lower it to "map_on_demand=32".> #> modprobe lnet > #> lctl ping 102.88.88.184 at o2ib > failed to ping 102.88.88.184 at o2ib: Input/output error > #> dmesg > Lustre: Listener bound to eth2:102.88.88.188:987:cxgb3_0 > Lustre: Register global MR array, MR size: 0xffffffff, array size: 2 > fmr_pool: Device cxgb3_0 does not support FMRs > LustreError: 4122:0:(o2iblnd.c:1393:kiblnd_create_fmr_pool()) Failed to > create FMR pool: -38 > Lustre: Added LNI 102.88.88.188 at o2ib [8/64/0/0] > LustreError: 2453:0:(o2iblnd.c:801:kiblnd_create_conn()) Can''t create > QP: -12, send_wr: 520, recv_wr: 18 > Lustre: 2453:0:(o2iblnd_cb.c:1953:kiblnd_peer_connect_failed()) Deleting > messages for 102.88.88.184 at o2ib: connection failerdma_create_qp() failed with -ENOMEM again, even with a much smaller send_wr (520 vs 2056). If lowering map_on_demand still couldn''t fix it, you''d need to look into HCA driver/firmware as to why it failed to create the QP (if there''s enough memory for it). Isaac
Vipul Pandya
2010-Feb-22  11:22 UTC
[Lustre-discuss] Lustre-1.8.1.1 over o2ib gives Input/Output error while executing lctl ping
Hello Issac, Thank you very much for your response. I lowered the map_on_demand value to 16 and now it works fine. However, I had once concern, whether lowering down this map_on_demand value would impact the performance of Lustre or not? Thanks again. Vipul -----Original Message----- From: He.Huang at Sun.COM [mailto:He.Huang at Sun.COM] Sent: 16 February 2010 21:29 To: Vipul Pandya Cc: lustre-discuss at lists.lustre.org Subject: Re: [Lustre-discuss] Lustre-1.8.1.1 over o2ib gives Input/Output error while executing lctl ping On Mon, Feb 15, 2010 at 09:45:10PM -0800, Vipul Pandya wrote:> ...... > -> I tried to load the ko2iblnd module as you have suggested. Butstill> I am unable to do ''lctl ping''. I am getting the same error as shown > below. > #> modprobe ko2iblnd map_on_demand=64Please lower it to "map_on_demand=32".> #> modprobe lnet > #> lctl ping 102.88.88.184 at o2ib > failed to ping 102.88.88.184 at o2ib: Input/output error > #> dmesg > Lustre: Listener bound to eth2:102.88.88.188:987:cxgb3_0 > Lustre: Register global MR array, MR size: 0xffffffff, array size: 2 > fmr_pool: Device cxgb3_0 does not support FMRs > LustreError: 4122:0:(o2iblnd.c:1393:kiblnd_create_fmr_pool()) Failedto> create FMR pool: -38 > Lustre: Added LNI 102.88.88.188 at o2ib [8/64/0/0] > LustreError: 2453:0:(o2iblnd.c:801:kiblnd_create_conn()) Can''t create > QP: -12, send_wr: 520, recv_wr: 18 > Lustre: 2453:0:(o2iblnd_cb.c:1953:kiblnd_peer_connect_failed())Deleting> messages for 102.88.88.184 at o2ib: connection failerdma_create_qp() failed with -ENOMEM again, even with a much smaller send_wr (520 vs 2056). If lowering map_on_demand still couldn''t fix it, you''d need to look into HCA driver/firmware as to why it failed to create the QP (if there''s enough memory for it). Isaac
Isaac Huang
2010-Feb-25  20:56 UTC
[Lustre-discuss] Lustre-1.8.1.1 over o2ib gives Input/Output error while executing lctl ping
On Mon, Feb 22, 2010 at 03:22:52AM -0800, Vipul Pandya wrote:> Hello Issac,Hi Vipul,> ...... > I lowered the map_on_demand value to 16 and now it works fine. > > However, I had once concern, whether lowering down this map_on_demand > value would impact the performance of Lustre or not?For iWARP, you probably have no alternative. I remembered that there''s a restriction somewhere in the iWARP stack that limits the size of SQs (which was why the rdma_create_qp errors happened), and lowering map_on_demand is the only way to reduce Lustre SQ length. For infiniband, lowering map_on_demand essentially reduces the # of RDMA WQE needed for each Lustre bulk data movement, at the cost of memory registration/deregistration at most per bulk transfer; without map_on_demand the o2iblnd uses a static MR so there''s no memory registration cost. There could a point in the # of frags of the bulk buffer, where the cost of handling RDMA WQEs (which usually equals the # of frags) exceeds the cost of MR, and that''s what you should set map_on_demand to. However, since both costs are mostly determined by HCA hardware/firmware implementation, there''s no one good setting for all interconnects, and you can only find it by testing. The LNet selftest is a useful tool for running such tests: http://manual.lustre.org/manual/LustreManual16_HTML/LustreIOKit.html#50610302_36273 Hope this helps, Isaac
Vipul Pandya
2010-Feb-26  08:56 UTC
[Lustre-discuss] Lustre-1.8.1.1 over o2ib gives Input/Output error while executing lctl ping
Hi Issac, This was very helpful. Thanks a lot for your response. Vipul -----Original Message----- From: He.Huang at Sun.COM [mailto:He.Huang at Sun.COM] Sent: 26 February 2010 02:27 To: Vipul Pandya Cc: lustre-discuss at lists.lustre.org Subject: Re: [Lustre-discuss] Lustre-1.8.1.1 over o2ib gives Input/Output error while executing lctl ping On Mon, Feb 22, 2010 at 03:22:52AM -0800, Vipul Pandya wrote:> Hello Issac,Hi Vipul,> ...... > I lowered the map_on_demand value to 16 and now it works fine. > > However, I had once concern, whether lowering down this map_on_demand > value would impact the performance of Lustre or not?For iWARP, you probably have no alternative. I remembered that there''s a restriction somewhere in the iWARP stack that limits the size of SQs (which was why the rdma_create_qp errors happened), and lowering map_on_demand is the only way to reduce Lustre SQ length. For infiniband, lowering map_on_demand essentially reduces the # of RDMA WQE needed for each Lustre bulk data movement, at the cost of memory registration/deregistration at most per bulk transfer; without map_on_demand the o2iblnd uses a static MR so there''s no memory registration cost. There could a point in the # of frags of the bulk buffer, where the cost of handling RDMA WQEs (which usually equals the # of frags) exceeds the cost of MR, and that''s what you should set map_on_demand to. However, since both costs are mostly determined by HCA hardware/firmware implementation, there''s no one good setting for all interconnects, and you can only find it by testing. The LNet selftest is a useful tool for running such tests: http://manual.lustre.org/manual/LustreManual16_HTML/LustreIOKit.html#506 10302_36273 Hope this helps, Isaac