Hello I''ve noticed that Lustre network error, especially LND errors, are considered as maskable errors. That means that on a production node, where debug mask is 0, those specific errors won''t be displayed if they happened. Does that mean that they are harmless? Do upper-layers resend their RPC/packet if LNDs report an error? When, in my case, o2iblnd says something like "RDMA failed" (neterror). It is a big issue? Some RPC were lost or not? Thanks in advance -- Aurelien Degremont
Alexey Lyashkov
2010-Sep-22 18:20 UTC
[Lustre-discuss] [Lustre-devel] Meaning of LND/neterrors ?
Hi Aurelien, That message you can see in two cases 1) low level network error, that bad - because client will be reconnected and resend requests after that error. that will add extra load to the service nodes. 2) service node (MDS, OSS) is restarted or hung, at that case transfer aborted. On Sep 22, 2010, at 19:20, Aurelien Degremont wrote:> Hello > > I''ve noticed that Lustre network error, especially LND errors, are considered as maskable errors. > That means that on a production node, where debug mask is 0, those specific errors won''t be displayed if they happened. > > Does that mean that they are harmless? > Do upper-layers resend their RPC/packet if LNDs report an error? > > When, in my case, o2iblnd says something like "RDMA failed" (neterror). It is a big issue? Some RPC were lost or not? > > Thanks in advance > > -- > Aurelien Degremont > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel
Hi Eric Barton a ?crit :> It''s expected that peers will crash and therefore the low-level > network should not clutter the logs with noise and the upper > layers should handle the problem by retrying or doing actual > recovery.Ok, so I can understand those errors to something like: - my IB network is not so clean - but Lustre upper layers will retry, and so this is transparent for them as long as i do not have too many of this kind of issue.> "RDMA failed" should really only occur when a peer node crashes. > However it could be a sign that there are deeper problems with > the network setup or hardware.Ok, but in my case we have issue where nodes do not crash but we got this kind of issues, like: (this occurs on LNET routeurs) Tx -> ... cookie ... sending 1 waiting 0: failed 12 Closing conn to ... : error -5 (waiting) Even if the corresponding node is responding and Lustre works for it.> If you suspect the network is > misbehaving, I''d run an LNET self-test. This is well documented > in the manual (at least to people who already know how it works ;) > and lets you soak-test the network from any convenient node.Ok :) I use it often, so that''s ok. But lnet_selftest has difficulties to works nicely if your using different OFED stacks (at least v1.4.2 against v1.5.1). So it is difficult to use it as a test for my current issue. Thanks Aur?lien> > Cheers, > Eric > > > >> -----Original Message----- >> From: lustre-devel-bounces at lists.lustre.org [mailto:lustre-devel-bounces at lists.lustre.org] On Behalf >> Of Aurelien Degremont >> Sent: 22 September 2010 5:20 PM >> To: lustre-devel at lists.lustre.org >> Subject: [Lustre-devel] Meaning of LND/neterrors ? >> >> Hello >> >> I''ve noticed that Lustre network error, especially LND errors, are considered as maskable errors. >> That means that on a production node, where debug mask is 0, those specific errors won''t be displayed >> if they happened. >> >> Does that mean that they are harmless? >> Do upper-layers resend their RPC/packet if LNDs report an error? >> >> When, in my case, o2iblnd says something like "RDMA failed" (neterror). It is a big issue? Some RPC >> were lost or not? >> >> Thanks in advance >> >> -- >> Aurelien Degremont >> _______________________________________________ >> Lustre-devel mailing list >> Lustre-devel at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-devel > >-- Aurelien Degremont CEA
Aurelien Degremont
2010-Sep-23 07:59 UTC
[Lustre-discuss] [Lustre-devel] Meaning of LND/neterrors ?
Alexey Lyashkov a ?crit :> Hi Aurelien, > > That message you can see in two cases > 1) low level network error, that bad - because client will be reconnected and resend requests after that error. > that will add extra load to the service nodes. > > 2) service node (MDS, OSS) is restarted or hung, at that case transfer aborted.In our cases nodes were not restarted, so the infiniband network seems to have issues. But these errors could be ignored as long as they do not appear to often. -- Aurelien Degremont CEA
Aurelien, could you give us some details about those difficulties of lnet_selftest over different OFED stacks when you see them again? It will be interesting to know because I think lnet_selftest should be stack independent. Thanks Liang On 9/23/10 3:57 PM, Aurelien Degremont wrote:> >> If you suspect the network is >> misbehaving, I''d run an LNET self-test. This is well documented >> in the manual (at least to people who already know how it works ;) >> and lets you soak-test the network from any convenient node. > Ok :) I use it often, so that''s ok. > But lnet_selftest has difficulties to works nicely if your using different OFED stacks (at least v1.4.2 against v1.5.1). > So it is difficult to use it as a test for my current issue. > > > Thanks > > Aur?lien > > >> Cheers, >> Eric >> >> >> >>-- Cheers Liang
Aurelien,> Eric Barton a ?crit : > > It''s expected that peers will crash and therefore the low-level > > network should not clutter the logs with noise and the upper > > layers should handle the problem by retrying or doing actual > > recovery. > > Ok, so I can understand those errors to something like: > - my IB network is not so clean > - but Lustre upper layers will retry, and so this is transparent for them > as long as i do not have too many of this kind of issue. > > > > "RDMA failed" should really only occur when a peer node crashes. > > However it could be a sign that there are deeper problems with > > the network setup or hardware. > > Ok, but in my case we have issue where nodes do not crash but we got this kind of issues, like: > (this occurs on LNET routeurs) > Tx -> ... cookie ... sending 1 waiting 0: failed 12 > Closing conn to ... : error -5 (waiting) > > Even if the corresponding node is responding and Lustre works for it.Then I''d suspect the IB network (switches and cabling). If I were you, I''d really want to root these problems out. While they persist, Lustre can evict clients spuriously and clients may appear to hang for many seconds at a time.> > If you suspect the network is > > misbehaving, I''d run an LNET self-test. This is well documented > > in the manual (at least to people who already know how it works ;) > > and lets you soak-test the network from any convenient node. > > Ok :) I use it often, so that''s ok. > But lnet_selftest has difficulties to works nicely if your using different OFED stacks (at least > v1.4.2 against v1.5.1). > So it is difficult to use it as a test for my current issue.Hmm - Lnet self-test doesn''t care at all what the underlying networks are so if networking breaks when you''re using different OFED stacks, I''d suspect the real problem is that OFED version interoperation doesn''t work when the network is under stress. I''m not clear what guarantees on version interoperation (if any) OFED makes, and even if it''s supposed to work, it could easily be buggy. Cheers, Eric