Dear lustre group, I''m hoping you can help with this problem. My configuration is as follows: 4 OSS''s | 1 MDS/MGS | n # nodes RPM''s installed on CentOS 5.2 systems: lustre-1.6.6-2.6.18_92.1.10.el5_lustre.1.6.6smp kernel-ib-1.3.1-2.6.18_92.1.10.el5_lustre.1.6.6smp lustre-modules-1.6.6-2.6.18_92.1.10.el5_lustre.1.6.6smp kernel-lustre-smp-2.6.18-92.1.10.el5_lustre.1.6.6 I''m able to start lustre on all OSS''s and MDS/MGS and mount to clients successfully. But eventually the lustre mount hangs (df hangs) on the clients. Initially I though it may be a fabric problem with ib but I see no errors on the switch and all cables are attached securely. The hanging issue is very random, some nodes will stay up for days and some hang after a couple of hours, but inevitably all nodes hang. when a node> hangs, it is unable to do an lctl ping to a OSS. For example, node-0-6> is hanging. From this node I can do an lctl ping to> oss-0-0, oss-0-2 and oss-0-3. Lctl ping to oss-0-1 just hangs. And if do> the same from oss-0-1 to node-0-6 I get the following error message:>> [root at tiger-oss-0-1 ~]# lctl ping 192.255.255.220 at o2ib> failed to ping 192.255.255.220 at o2ib: Input/output error>> Interestingly enough the oss can ping any other node:>> [root at tiger-oss-0-1 ~]# lctl ping 192.255.255.222 at o2ib> 12345-0 at lo> 12345-192.255.255.222 at o2ib>> And the node can ping any other system:>> [root at tiger-node-0-6 ~]# lctl ping 192.255.255.253 at o2ib> 12345-0 at lo> 12345-192.255.255.253 at o2ib>> Only the communication between the two is broken.>> The only messages from oss-0-1 related to node-0-6 are these:>> [root at tiger-oss-0-1 ~]# cat /var/log/messages |grep 192.255.255.220> Mar 26 04:22:26 tiger-oss-0-1 kernel: Lustre: lustre-OST0008: haven''t> heard from client d17b6a66-9ba9-18a9-e706-8fa35ad18119 (at> 192.255.255.220 at o2ib) in 227 seconds. I think it''s dead, and I am> evicting it.> Mar 26 04:22:26 tiger-oss-0-1 kernel: Lustre: lustre-OST000d: haven''t> heard from client d17b6a66-9ba9-18a9-e706-8fa35ad18119 (at> 192.255.255.220 at o2ib) in 227 seconds. I think it''s dead, and I am> evicting it.>> Messages from node-0-6 related to oss-0-1:>> [root at tiger-node-0-6 ~]# cat /var/log/messages |grep 192.255.255.252> Mar 25 18:36:01 tiger-node-0-6 kernel: LustreError:> 4482:0:(o2iblnd_cb.c:2891:kiblnd_check_conns()) Timed out RDMA with> 192.255.255.252 at o2ib> Mar 25 18:36:01 tiger-node-0-6 kernel: LustreError:> 4482:0:(events.c:66:request_out_callback()) @@@ type 4, status -103> req at ffff81006ea43a00 x2646/t0> o400->lustre-OST000f_UUID at 192.255.255.252@o2ib:28/4 lens 128/256 e 0 to> 100 dl 1238020602 ref 2 fl Rpc:N/0/0 rc 0/0> Mar 25 18:36:01 tiger-node-0-6 kernel: Lustre: Request x2644 sent from> lustre-OST000d-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 59s ago> has timed out (limit 100s).> Mar 25 18:36:01 tiger-node-0-6 kernel: Lustre:> lustre-OST000d-osc-ffff81007f555000: Connection to service> lustre-OST000d via nid 192.255.255.252 at o2ib was lost; in progress> operations using this service will wait for recovery to complete.> Mar 25 18:36:01 tiger-node-0-6 kernel: Lustre:> lustre-OST000e-osc-ffff81007f555000: Connection restored to service> lustre-OST000e using nid 192.255.255.252 at o2ib.> Mar 25 18:36:01 tiger-node-0-6 kernel: Lustre:> lustre-OST000d-osc-ffff81007f555000: Connection restored to service> lustre-OST000d using nid 192.255.255.252 at o2ib.> Mar 26 04:19:35 tiger-node-0-6 kernel: LustreError:> 4482:0:(o2iblnd_cb.c:2891:kiblnd_check_conns()) Timed out RDMA with> 192.255.255.252 at o2ib> Mar 26 04:20:20 tiger-node-0-6 kernel: Lustre: Request x50261 sent from> lustre-OST000d-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 101s ago> has timed out (limit 100s).> Mar 26 04:20:20 tiger-node-0-6 kernel: Lustre:> lustre-OST000d-osc-ffff81007f555000: Connection to service> lustre-OST000d via nid 192.255.255.252 at o2ib was lost; in progress> operations using this service will wait for recovery to complete.> Mar 26 04:20:44 tiger-node-0-6 kernel: Lustre: Request x50290 sent from> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago> has timed out (limit 100s).> Mar 26 04:20:44 tiger-node-0-6 kernel: Lustre:> lustre-OST0008-osc-ffff81007f555000: Connection to service> lustre-OST0008 via nid 192.255.255.252 at o2ib was lost; in progress> operations using this service will wait for recovery to complete.> Mar 26 04:21:10 tiger-node-0-6 kernel: Lustre: Request x50324 sent from> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 101s ago> has timed out (limit 100s).> Mar 26 04:21:35 tiger-node-0-6 kernel: Lustre: Request x50358 sent from> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 101s ago> has timed out (limit 100s).> Mar 26 04:22:00 tiger-node-0-6 kernel: Lustre: Request x50416 sent from> lustre-OST000d-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago> has timed out (limit 100s).> Mar 26 04:22:24 tiger-node-0-6 kernel: Lustre: Request x50429 sent from> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago> has timed out (limit 100s).> Mar 26 04:24:04 tiger-node-0-6 kernel: Lustre: Request x50538 sent from> lustre-OST000d-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago> has timed out (limit 100s).> Mar 26 04:24:30 tiger-node-0-6 kernel: Lustre: Request x50567 sent from> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 101s ago> has timed out (limit 100s).> Mar 26 04:26:09 tiger-node-0-6 kernel: Lustre: Request x50676 sent from> lustre-OST000d-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago> has timed out (limit 100s).> Mar 26 04:28:14 tiger-node-0-6 kernel: Lustre: Request x50814 sent from> lustre-OST000d-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago> has timed out (limit 100s).> Mar 26 04:30:44 tiger-node-0-6 kernel: Lustre: Request x50981 sent from> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago> has timed out (limit 100s).> Mar 26 04:36:33 tiger-node-0-6 kernel: Lustre: Request x51366 sent from> lustre-OST000d-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago> has timed out (limit 100s).> Mar 26 04:45:18 tiger-node-0-6 kernel: Lustre: Request x51947 sent from> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago> has timed out (limit 100s).> Mar 26 04:55:43 tiger-node-0-6 kernel: Lustre: Request x52637 sent from> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago> has timed out (limit 100s).> Mar 26 05:06:08 tiger-node-0-6 kernel: Lustre: Request x53327 sent from> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago> has timed out (limit 100s).> Mar 26 05:16:33 tiger-node-0-6 kernel: Lustre: Request x54017 sent from> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago> has timed out (limit 100s).> Mar 26 05:26:58 tiger-node-0-6 kernel: Lustre: Request x54707 sent from> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago> has timed out (limit 100s).> Mar 26 05:37:23 tiger-node-0-6 kernel: Lustre: Request x55397 sent from> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago> has timed out (limit 100s).> Mar 26 05:47:47 tiger-node-0-6 kernel: Lustre: Request x56087 sent from> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago> has timed out (limit 100s).> Mar 26 05:58:12 tiger-node-0-6 kernel: Lustre: Request x56777 sent from> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago> has timed out (limit 100s).> Mar 26 06:08:37 tiger-node-0-6 kernel: Lustre: Request x57467 sent from> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago> has timed out (limit 100s).> Mar 26 06:19:02 tiger-node-0-6 kernel: Lustre: Request x58157 sent from> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago> has timed out (limit 100s).> Mar 26 06:29:27 tiger-node-0-6 kernel: Lustre: Request x58847 sent from> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago> has timed out (limit 100s).> Mar 26 06:39:52 tiger-node-0-6 kernel: Lustre: Request x59537 sent from> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago> has timed out (limit 100s).> Mar 26 06:50:17 tiger-node-0-6 kernel: Lustre: Request x60227 sent from> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago> has timed out (limit 100s).> Mar 26 07:00:41 tiger-node-0-6 kernel: Lustre: Request x60917 sent from> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago> has timed out (limit 100s).> Mar 26 07:11:06 tiger-node-0-6 kernel: Lustre: Request x61607 sent from> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago> has timed out (limit 100s).> Mar 26 07:21:31 tiger-node-0-6 kernel: Lustre: Request x62271 sent from> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago> has timed out (limit 100s).> Mar 26 07:31:56 tiger-node-0-6 kernel: Lustre: Request x62961 sent from> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago> has timed out (limit 100s).> Mar 26 07:42:21 tiger-node-0-6 kernel: Lustre: Request x63651 sent from> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago> has timed out (limit 100s).> Mar 26 07:52:46 tiger-node-0-6 kernel: Lustre: Request x64341 sent from> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago> has timed out (limit 100s).> Mar 26 08:03:11 tiger-node-0-6 kernel: Lustre: Request x65031 sent from> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago> has timed out (limit 100s).> Mar 26 08:13:36 tiger-node-0-6 kernel: Lustre: Request x65721 sent from> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago> has timed out (limit 100s).> Mar 26 08:24:00 tiger-node-0-6 kernel: Lustre: Request x66411 sent from> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago> has timed out (limit 100s).> Mar 26 08:34:25 tiger-node-0-6 kernel: Lustre: Request x67101 sent from> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago> has timed out (limit 100s).> Mar 26 08:44:50 tiger-node-0-6 kernel: Lustre: Request x67791 sent from> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago> has timed out (limit 100s).> Mar 26 08:55:15 tiger-node-0-6 kernel: Lustre: Request x68481 sent from> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago> has timed out (limit 100s).> Mar 26 09:05:40 tiger-node-0-6 kernel: Lustre: Request x69171 sent from> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago> has timed out (limit 100s).> Mar 26 09:16:05 tiger-node-0-6 kernel: Lustre: Request x69861 sent from> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago> has timed out (limit 100s).> Mar 26 09:26:30 tiger-node-0-6 kernel: Lustre: Request x70551 sent from> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago> has timed out (limit 100s).> Mar 26 09:36:54 tiger-node-0-6 kernel: Lustre: Request x71249 sent from> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago> has timed out (limit 100s).> Mar 26 09:38:43 tiger-node-0-6 kernel: Lustre:> 8821:0:(api-ni.c:1685:lnet_ping()) ping 12345-192.255.255.252 at o2ib: late> network completion> Mar 26 09:39:43 tiger-node-0-6 kernel: Lustre:> 8821:0:(api-ni.c:1685:lnet_ping()) ping 12345-192.255.255.252 at o2ib: late> network completion> Mar 26 09:40:43 tiger-node-0-6 kernel: Lustre:> 8821:0:(api-ni.c:1685:lnet_ping()) ping 12345-192.255.255.252 at o2ib: late> network completion> Mar 26 09:41:43 tiger-node-0-6 kernel: Lustre:> 8821:0:(api-ni.c:1685:lnet_ping()) ping 12345-192.255.255.252 at o2ib: late> network completion> Mar 26 09:42:43 tiger-node-0-6 kernel: Lustre:> 8821:0:(api-ni.c:1685:lnet_ping()) ping 12345-192.255.255.252 at o2ib: late> network completion> Mar 26 09:47:19 tiger-node-0-6 kernel: Lustre: Request x71939 sent from> lustre-OST0008-osc-ffff81007f555000 to NID 192.255.255.252 at o2ib 100s ago> has timed out (limit 100s).> Mar 26 09:48:43 tiger-node-0-6 kernel: Lustre:> 8821:0:(api-ni.c:1685:lnet_ping()) ping 12345-192.255.255.252 at o2ib: late> network completion
Brian J. Murrell
2009-Mar-31 14:48 UTC
[Lustre-discuss] Client evictions and RMDA failures
On Tue, 2009-03-31 at 10:29 -0400, syed haider wrote:> > when a node > > > hangs, it is unable to do an lctl ping to a OSS. For example, node-0-6 > > > is hanging. From this node I can do an lctl ping to > > > oss-0-0, oss-0-2 and oss-0-3. Lctl ping to oss-0-1 just hangs. And if do > > > the same from oss-0-1 to node-0-6 I get the following error message: > > > > > > [root at tiger-oss-0-1 ~]# lctl ping 192.255.255.220 at o2ib > > > failed to ping 192.255.255.220 at o2ib: Input/output errorThat, together with all of the log messages you posted looks an awful lot like networking problems. You need to find some independent method of testing network connectivity when this happens. I think there are tools in the OFED distribution to test I/B networks. You need to make sure that whatever test/tool you use utilizes RDMA as there are several communications channels in an I/B connection and LNET uses the RDMA channel. ICMP ping on an I/B network is not an indicator that LNET will be happy with that network. Just because a piece of networking gear fails to report any errors would not for a minute make me believe there are none. Only an empirical test would do that for me. Cheers, b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090331/2071c374/attachment.bin
Hi Brian, Thanks for the response. I''ve run a few ib tests and here is an interesting response on the port for a failed node: [root at tiger-node-0-1 ~]# ibqueryerrors.pl -c -a -r Suppressing: RcvSwRelayErrors Errors for 0x0008f104003f0e21 "ISR9288/ISR9096 Voltaire sLB-24" GUID 0x0008f104003f0e21 port 23: [XmtDiscards == 4] Actions: XmtDiscards: This is a symptom of congestion and may require tweaking either HOQ or switch lifetime values Link info: 5 23[20] ==( 4X 2.5 Gbps)==> 0x0008f10403970e20 1[ ] "tiger-node-0-11 HCA-1" [root at tiger-node-0-1 ~]# This is interesting because other sources state that my problem is possibly related to an over-subscribed network even though there is no traffic on the network when these nodes hang. Are you familar with what settings need to be tweaked on a voltaire ib switch (9550) to possibly resolve this problem? Unfortunately, my knowledge of ib is minimal, any help is appreciate. Thanks! On Tue, Mar 31, 2009 at 10:48 AM, Brian J. Murrell <Brian.Murrell at sun.com> wrote:> On Tue, 2009-03-31 at 10:29 -0400, syed haider wrote: >> >> when a node >> >> > hangs, it is unable to do an lctl ping to a OSS. For example, node-0-6 >> >> > is hanging. From this node I can do an lctl ping to >> >> > oss-0-0, oss-0-2 and oss-0-3. Lctl ping to oss-0-1 just hangs. And if do >> >> > the same from oss-0-1 to node-0-6 I get the following error message: >> >> > >> >> > [root at tiger-oss-0-1 ~]# lctl ping 192.255.255.220 at o2ib >> >> > failed to ping 192.255.255.220 at o2ib: Input/output error > > That, together with all of the log messages you posted looks an awful > lot like networking problems. ?You need to find some independent method > of testing network connectivity when this happens. ?I think there are > tools in the OFED distribution to test I/B networks. ?You need to make > sure that whatever test/tool you use utilizes RDMA as there are several > communications channels in an I/B connection and LNET uses the RDMA > channel. ?ICMP ping on an I/B network is not an indicator that LNET will > be happy with that network. > > Just because a piece of networking gear fails to report any errors would > not for a minute make me believe there are none. ?Only an empirical test > would do that for me. > > Cheers, > b. > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
Brian J. Murrell
2009-Mar-31 17:32 UTC
[Lustre-discuss] Client evictions and RMDA failures
On Tue, 2009-03-31 at 11:38 -0400, syed haider wrote:> Hi Brian,Hi.> Thanks for the response. I''ve run a few ib tests and here is an > interesting response on the port for a failed node: > > [root at tiger-node-0-1 ~]# ibqueryerrors.pl -c -a -r > Suppressing: RcvSwRelayErrors > Errors for 0x0008f104003f0e21 "ISR9288/ISR9096 Voltaire sLB-24" > GUID 0x0008f104003f0e21 port 23: [XmtDiscards == 4] > Actions: > XmtDiscards: This is a symptom of congestion and may require > tweaking either HOQ or switch lifetime values > > Link info: 5 23[20] ==( 4X 2.5 Gbps)==> > 0x0008f10403970e20 1[ ] "tiger-node-0-11 HCA-1"FWIW, I have absolutely no idea what any of this means.> This is interesting because other sources state that my problem is > possibly related to an over-subscribed network even though there is no > traffic on the network when these nodes hang. Are you familar with > what settings need to be tweaked on a voltaire ib switch (9550) to > possibly resolve this problem?Not at all. Probably there are other lists out there with specific I/B experts to help. You might also try going right back to your I/B vendor. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090331/14d89e72/attachment.bin
Thanks Brian. On one of the hung nodes I umounted lustre, rmmod lustre and reloaded the module and I mounted lustre again. The mount hangs again but I see 16 OSTs in "ST" state. These are also listed as in "UP" state: 0 UP mgc MGC192.255.255.254 at o2ib bf0dec15-659a-5817-6c78-0d43ca25e7c9 5 1 UP lov lustre-clilov-ffff8100af56d000 bfab1519-8ef0-1b82-a0dd-ee82577481cc 4 2 UP mdc lustre-MDT0000-mdc-ffff8100af56d000 bfab1519-8ef0-1b82-a0dd-ee82577481cc 5 3 UP osc lustre-OST0000-osc-ffff8100af56d000 bfab1519-8ef0-1b82-a0dd-ee82577481cc 5 4 UP osc lustre-OST0001-osc-ffff8100af56d000 bfab1519-8ef0-1b82-a0dd-ee82577481cc 5 5 UP osc lustre-OST0002-osc-ffff8100af56d000 bfab1519-8ef0-1b82-a0dd-ee82577481cc 5 6 UP osc lustre-OST0003-osc-ffff8100af56d000 bfab1519-8ef0-1b82-a0dd-ee82577481cc 5 7 UP osc lustre-OST0004-osc-ffff8100af56d000 bfab1519-8ef0-1b82-a0dd-ee82577481cc 5 8 UP osc lustre-OST0005-osc-ffff8100af56d000 bfab1519-8ef0-1b82-a0dd-ee82577481cc 5 9 UP osc lustre-OST0006-osc-ffff8100af56d000 bfab1519-8ef0-1b82-a0dd-ee82577481cc 5 10 UP osc lustre-OST0007-osc-ffff8100af56d000 bfab1519-8ef0-1b82-a0dd-ee82577481cc 5 11 UP osc lustre-OST0008-osc-ffff8100af56d000 bfab1519-8ef0-1b82-a0dd-ee82577481cc 5 12 UP osc lustre-OST0009-osc-ffff8100af56d000 bfab1519-8ef0-1b82-a0dd-ee82577481cc 5 13 UP osc lustre-OST000a-osc-ffff8100af56d000 bfab1519-8ef0-1b82-a0dd-ee82577481cc 5 14 UP osc lustre-OST000b-osc-ffff8100af56d000 bfab1519-8ef0-1b82-a0dd-ee82577481cc 5 15 UP osc lustre-OST000c-osc-ffff8100af56d000 bfab1519-8ef0-1b82-a0dd-ee82577481cc 5 16 UP osc lustre-OST000d-osc-ffff8100af56d000 bfab1519-8ef0-1b82-a0dd-ee82577481cc 5 17 UP osc lustre-OST000e-osc-ffff8100af56d000 bfab1519-8ef0-1b82-a0dd-ee82577481cc 5 18 UP osc lustre-OST000f-osc-ffff8100af56d000 bfab1519-8ef0-1b82-a0dd-ee82577481cc 5 19 ST osc lustre-OST0010-osc-ffff8100b9519400 3d28a9b3-d7c7-2846-e634-43bdce74f96a 2 20 ST osc lustre-OST0011-osc-ffff8100b9519400 3d28a9b3-d7c7-2846-e634-43bdce74f96a 2 21 ST osc lustre-OST0012-osc-ffff8100b9519400 3d28a9b3-d7c7-2846-e634-43bdce74f96a 2 22 ST osc lustre-OST0013-osc-ffff8100b9519400 3d28a9b3-d7c7-2846-e634-43bdce74f96a 2 23 ST osc lustre-OST0014-osc-ffff8100b9519400 3d28a9b3-d7c7-2846-e634-43bdce74f96a 2 24 ST osc lustre-OST0015-osc-ffff8100b9519400 3d28a9b3-d7c7-2846-e634-43bdce74f96a 2 25 ST osc lustre-OST0016-osc-ffff8100b9519400 3d28a9b3-d7c7-2846-e634-43bdce74f96a 2 26 ST osc lustre-OST0017-osc-ffff8100b9519400 3d28a9b3-d7c7-2846-e634-43bdce74f96a 2 27 ST osc lustre-OST0018-osc-ffff8100b9519400 3d28a9b3-d7c7-2846-e634-43bdce74f96a 2 28 ST osc lustre-OST0019-osc-ffff8100b9519400 3d28a9b3-d7c7-2846-e634-43bdce74f96a 2 29 ST osc lustre-OST001a-osc-ffff8100b9519400 3d28a9b3-d7c7-2846-e634-43bdce74f96a 2 30 ST osc lustre-OST001b-osc-ffff8100b9519400 3d28a9b3-d7c7-2846-e634-43bdce74f96a 2 31 ST osc lustre-OST001c-osc-ffff8100b9519400 3d28a9b3-d7c7-2846-e634-43bdce74f96a 2 32 ST osc lustre-OST001d-osc-ffff8100b9519400 3d28a9b3-d7c7-2846-e634-43bdce74f96a 2 33 ST osc lustre-OST001e-osc-ffff8100b9519400 3d28a9b3-d7c7-2846-e634-43bdce74f96a 2 34 ST osc lustre-OST001f-osc-ffff8100b9519400 3d28a9b3-d7c7-2846-e634-43bdce74f96a 2 35 UP osc lustre-OST0010-osc-ffff8100af56d000 bfab1519-8ef0-1b82-a0dd-ee82577481cc 5 36 UP osc lustre-OST0011-osc-ffff8100af56d000 bfab1519-8ef0-1b82-a0dd-ee82577481cc 5 37 UP osc lustre-OST0012-osc-ffff8100af56d000 bfab1519-8ef0-1b82-a0dd-ee82577481cc 5 38 UP osc lustre-OST0013-osc-ffff8100af56d000 bfab1519-8ef0-1b82-a0dd-ee82577481cc 5 39 UP osc lustre-OST0014-osc-ffff8100af56d000 bfab1519-8ef0-1b82-a0dd-ee82577481cc 5 40 UP osc lustre-OST0015-osc-ffff8100af56d000 bfab1519-8ef0-1b82-a0dd-ee82577481cc 5 41 UP osc lustre-OST0016-osc-ffff8100af56d000 bfab1519-8ef0-1b82-a0dd-ee82577481cc 5 42 UP osc lustre-OST0017-osc-ffff8100af56d000 bfab1519-8ef0-1b82-a0dd-ee82577481cc 5 43 UP osc lustre-OST0018-osc-ffff8100af56d000 bfab1519-8ef0-1b82-a0dd-ee82577481cc 5 44 UP osc lustre-OST0019-osc-ffff8100af56d000 bfab1519-8ef0-1b82-a0dd-ee82577481cc 5 45 UP osc lustre-OST001a-osc-ffff8100af56d000 bfab1519-8ef0-1b82-a0dd-ee82577481cc 5 46 UP osc lustre-OST001b-osc-ffff8100af56d000 bfab1519-8ef0-1b82-a0dd-ee82577481cc 5 47 UP osc lustre-OST001c-osc-ffff8100af56d000 bfab1519-8ef0-1b82-a0dd-ee82577481cc 5 48 UP osc lustre-OST001d-osc-ffff8100af56d000 bfab1519-8ef0-1b82-a0dd-ee82577481cc 5 49 UP osc lustre-OST001e-osc-ffff8100af56d000 bfab1519-8ef0-1b82-a0dd-ee82577481cc 5 50 UP osc lustre-OST001f-osc-ffff8100af56d000 bfab1519-8ef0-1b82-a0dd-ee82577481cc 5 What would cause this? Could this be because of the fabric also? On Tue, Mar 31, 2009 at 1:32 PM, Brian J. Murrell <Brian.Murrell at sun.com> wrote:> On Tue, 2009-03-31 at 11:38 -0400, syed haider wrote: >> Hi Brian, > > Hi. > >> Thanks for the response. I''ve run a few ib tests and here is an >> interesting response on the port for a failed node: >> >> [root at tiger-node-0-1 ~]# ibqueryerrors.pl -c -a -r >> Suppressing: RcvSwRelayErrors >> Errors for 0x0008f104003f0e21 "ISR9288/ISR9096 Voltaire sLB-24" >> ? ?GUID 0x0008f104003f0e21 port 23: [XmtDiscards == 4] >> ? ? ? ? ?Actions: >> ? ? ? ? ? XmtDiscards: This is a symptom of congestion and may require >> tweaking either HOQ or switch lifetime values >> >> ? ? ? ? ?Link info: ? ? ?5 ? 23[20] ?==( 4X 2.5 Gbps)==> >> 0x0008f10403970e20 ? ?1[ ?] "tiger-node-0-11 HCA-1" > > FWIW, I have absolutely no idea what any of this means. > >> This is interesting because other sources state that my problem is >> possibly related to an over-subscribed network even though there is no >> traffic on the network when these nodes hang. Are you familar with >> what settings need to be tweaked on a voltaire ib switch (9550) to >> possibly resolve this problem? > > Not at all. ?Probably there are other lists out there with specific I/B > experts to help. ?You might also try going right back to your I/B > vendor. > > b. > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
Brian J. Murrell
2009-Mar-31 20:29 UTC
[Lustre-discuss] Client evictions and RMDA failures
On Tue, 2009-03-31 at 16:02 -0400, syed haider wrote:> > What would cause this? Could this be because of the fabric also?Sure. When the fabric is flaky all sorts of unexpected things (can) happen. Really, your primary task should be making your network stable rather than continuing to muck with Lustre on it. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090331/464aea44/attachment.bin
Thanks for your help Brian. We''ve resolved the problem by upgrading the firmware on the HCA''s from 1.0.7 to 1.2.0. mounts have stabilized. Also upgraded to ofed 1.4 (minus the kernel-ib patches). On Tue, Mar 31, 2009 at 4:29 PM, Brian J. Murrell <Brian.Murrell at sun.com> wrote:> On Tue, 2009-03-31 at 16:02 -0400, syed haider wrote: >> >> What would cause this? Could this be because of the fabric also? > > Sure. ?When the fabric is flaky all sorts of unexpected things (can) > happen. ?Really, your primary task should be making your network stable > rather than continuing to muck with Lustre on it. > > b. > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >