Hi, Can anyone share their experiences using OFED 1.5.1 on Lustre Clients? This is needed because RHAT 5.5 does not support OFED 1.4.2. Is there an effort underway at Oracle to qualify OFED 1.5.1? Thanks. Staff Engineer Terascala, Inc. 508-588-1501 www.terascala.com <http://www.terascala.com/> -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100617/ac84f67d/attachment.html
On Jun 17, 2010, at 1:23 PM, Roger Spellman wrote:> Hi, > Can anyone share their experiences using OFED 1.5.1 on Lustre > Clients? This is needed because RHAT 5.5 does not support OFED 1.4.2.We''re using it with ~9300 clients running Lustre 1.6.6 and haven''t identified any OFED 1.5.1-specific issues. If you''re using 1.6.x and haven''t done so already, you''ll want to apply bug 19520 attach 23498. We just deployed 1.8.2 on a separate cluster with ~130 clients and haven''t seen any OFED-specific issues there, either. While we did see some failures when running acc-sm with the stack of software we use here, none of those had anything to do with the version of OFED we were running. We''re using SLES10 SP{2,3}, OFED bits built by SGI, not what''s included in the distro. Your mileage, of course, may vary :-) Jason -- Jason Rappleye System Administrator NASA Advanced Supercomputing Division NASA Ames Research Center Moffett Field, CA 94035
Jason, Thanks for this response. This brings up another question: The bug number you referred to mentions an LBUG in OFED 1.4.1. Are you saying that the same LBUG would occur with OFED 1.5.1 too without the patch? -Roger> -----Original Message----- > From: Jason Rappleye [mailto:jason.rappleye at nasa.gov] > Sent: Thursday, June 17, 2010 5:02 PM > To: Roger Spellman > Cc: lustre-discuss at lists.lustre.org > Subject: Re: [Lustre-discuss] OFED 1.5.1 on Clients > > > On Jun 17, 2010, at 1:23 PM, Roger Spellman wrote: > > > Hi, > > Can anyone share their experiences using OFED 1.5.1 on Lustre > > Clients? This is needed because RHAT 5.5 does not support OFED1.4.2.> > We''re using it with ~9300 clients running Lustre 1.6.6 and haven''t > identified any OFED 1.5.1-specific issues. If you''re using 1.6.x and > haven''t done so already, you''ll want to apply bug 19520 attach 23498. > > We just deployed 1.8.2 on a separate cluster with ~130 clients and > haven''t seen any OFED-specific issues there, either. While we did see > some failures when running acc-sm with the stack of software we use > here, none of those had anything to do with the version of OFED we > were running. >
On Jun 18, 2010, at 7:49 AM, Roger Spellman wrote:> Jason, > Thanks for this response. This brings up another question:np> The bug number you referred to mentions an LBUG in OFED 1.4.1. Are > you > saying that the same LBUG would occur with OFED 1.5.1 too without the > patch?Yes. The patch handles new RDMA CM events that appear in OFED 1.4(. 1?). They are also in 1.5.1. Without the patch, receipt of one of those events will result in an LBUG. Jason> > -Roger > >> -----Original Message----- >> From: Jason Rappleye [mailto:jason.rappleye at nasa.gov] >> Sent: Thursday, June 17, 2010 5:02 PM >> To: Roger Spellman >> Cc: lustre-discuss at lists.lustre.org >> Subject: Re: [Lustre-discuss] OFED 1.5.1 on Clients >> >> >> On Jun 17, 2010, at 1:23 PM, Roger Spellman wrote: >> >>> Hi, >>> Can anyone share their experiences using OFED 1.5.1 on Lustre >>> Clients? This is needed because RHAT 5.5 does not support OFED > 1.4.2. >> >> We''re using it with ~9300 clients running Lustre 1.6.6 and haven''t >> identified any OFED 1.5.1-specific issues. If you''re using 1.6.x and >> haven''t done so already, you''ll want to apply bug 19520 attach 23498. >> >> We just deployed 1.8.2 on a separate cluster with ~130 clients and >> haven''t seen any OFED-specific issues there, either. While we did see >> some failures when running acc-sm with the stack of software we use >> here, none of those had anything to do with the version of OFED we >> were running. >> >-- Jason Rappleye System Administrator NASA Advanced Supercomputing Division NASA Ames Research Center Moffett Field, CA 94035
Jason (or anyone else), Patch 23498 ( https://bugzilla.lustre.org/attachment.cgi?id=23498 ) says: Index: ./lnet/klnds/o2iblnd/o2iblnd_cb.c ==================================================================RCS file: /cvsroot/cfs/lnet/klnds/o2iblnd/o2iblnd_cb.c,v retrieving revision 1.12.6.1.2.5 diff -u -p -u -p -r1.12.6.1.2.5 o2iblnd_cb.c --- ./lnet/klnds/o2iblnd/o2iblnd_cb.c 20 Nov 2008 09:29:34 -0000 1.12.6.1.2.5 +++ ./lnet/klnds/o2iblnd/o2iblnd_cb.c 15 May 2009 12:26:07 -0000 @@ -2654,6 +2654,8 @@ kiblnd_cm_callback(struct rdma_cm_id *cm switch (event->event) { default: + CERROR("Unexpected event: %d, status: %d\n", + event->event, event->status); LBUG(); Why should we LBUG just for an unexpected event? Couldn''t it just be ignored? -Roger> -----Original Message----- > From: Jason Rappleye [mailto:jason.rappleye at nasa.gov] > Sent: Friday, June 18, 2010 2:16 PM > To: Roger Spellman > Cc: lustre-discuss at lists.lustre.org > Subject: Re: [Lustre-discuss] OFED 1.5.1 on Clients > > > On Jun 18, 2010, at 7:49 AM, Roger Spellman wrote: > > > Jason, > > Thanks for this response. This brings up another question: > > np > > > The bug number you referred to mentions an LBUG in OFED 1.4.1. Are > > you > > saying that the same LBUG would occur with OFED 1.5.1 too withoutthe> > patch? > > Yes. The patch handles new RDMA CM events that appear in OFED 1.4(. > 1?). They are also in 1.5.1. Without the patch, receipt of one of > those events will result in an LBUG. > > Jason > > > > > -Roger > > > >> -----Original Message----- > >> From: Jason Rappleye [mailto:jason.rappleye at nasa.gov] > >> Sent: Thursday, June 17, 2010 5:02 PM > >> To: Roger Spellman > >> Cc: lustre-discuss at lists.lustre.org > >> Subject: Re: [Lustre-discuss] OFED 1.5.1 on Clients > >> > >> > >> On Jun 17, 2010, at 1:23 PM, Roger Spellman wrote: > >> > >>> Hi, > >>> Can anyone share their experiences using OFED 1.5.1 on Lustre > >>> Clients? This is needed because RHAT 5.5 does not support OFED > > 1.4.2. > >> > >> We''re using it with ~9300 clients running Lustre 1.6.6 and haven''t > >> identified any OFED 1.5.1-specific issues. If you''re using 1.6.xand> >> haven''t done so already, you''ll want to apply bug 19520 attach23498.> >> > >> We just deployed 1.8.2 on a separate cluster with ~130 clients and > >> haven''t seen any OFED-specific issues there, either. While we didsee> >> some failures when running acc-sm with the stack of software we use > >> here, none of those had anything to do with the version of OFED we > >> were running. > >> > > > > -- > Jason Rappleye > System Administrator > NASA Advanced Supercomputing Division > NASA Ames Research Center > Moffett Field, CA 94035 > > > > > >
Since the event is unknown it is hard to know in advance whether it can be ignored or not. Some protocols encode in the message type whether it is ''mandatory'' to handle or ''optional'', or as Lustre does it negotiates in advance what operations are understood and never sends unknown requests to peers. I have no idea whether IB does this or not. In the absence of such information, the safest behaviour (if not the most robust) is to fail since the unknown event may be critical to the correct behaviour of the system. Cheers, Andreas On 2010-06-18, at 12:48, Roger Spellman <Roger.Spellman at terascala.com> wrote:> Jason (or anyone else), > > Patch 23498 ( https://bugzilla.lustre.org/attachment.cgi?id=23498 ) > says: > > Index: ./lnet/klnds/o2iblnd/o2iblnd_cb.c > ==================================================================> RCS file: /cvsroot/cfs/lnet/klnds/o2iblnd/o2iblnd_cb.c,v > retrieving revision 1.12.6.1.2.5 > diff -u -p -u -p -r1.12.6.1.2.5 o2iblnd_cb.c > --- ./lnet/klnds/o2iblnd/o2iblnd_cb.c 20 Nov 2008 09:29:34 -0000 > 1.12.6.1.2.5 > +++ ./lnet/klnds/o2iblnd/o2iblnd_cb.c 15 May 2009 12:26:07 -0000 > @@ -2654,6 +2654,8 @@ kiblnd_cm_callback(struct rdma_cm_id *cm > > switch (event->event) { > default: > + CERROR("Unexpected event: %d, status: %d\n", > + event->event, event->status); > LBUG(); > > Why should we LBUG just for an unexpected event? Couldn''t it just be > ignored? > > -Roger > >> -----Original Message----- >> From: Jason Rappleye [mailto:jason.rappleye at nasa.gov] >> Sent: Friday, June 18, 2010 2:16 PM >> To: Roger Spellman >> Cc: lustre-discuss at lists.lustre.org >> Subject: Re: [Lustre-discuss] OFED 1.5.1 on Clients >> >> >> On Jun 18, 2010, at 7:49 AM, Roger Spellman wrote: >> >>> Jason, >>> Thanks for this response. This brings up another question: >> >> np >> >>> The bug number you referred to mentions an LBUG in OFED 1.4.1. Are >>> you >>> saying that the same LBUG would occur with OFED 1.5.1 too without > the >>> patch? >> >> Yes. The patch handles new RDMA CM events that appear in OFED 1.4(. >> 1?). They are also in 1.5.1. Without the patch, receipt of one of >> those events will result in an LBUG. >> >> Jason >> >>> >>> -Roger >>> >>>> -----Original Message----- >>>> From: Jason Rappleye [mailto:jason.rappleye at nasa.gov] >>>> Sent: Thursday, June 17, 2010 5:02 PM >>>> To: Roger Spellman >>>> Cc: lustre-discuss at lists.lustre.org >>>> Subject: Re: [Lustre-discuss] OFED 1.5.1 on Clients >>>> >>>> >>>> On Jun 17, 2010, at 1:23 PM, Roger Spellman wrote: >>>> >>>>> Hi, >>>>> Can anyone share their experiences using OFED 1.5.1 on Lustre >>>>> Clients? This is needed because RHAT 5.5 does not support OFED >>> 1.4.2. >>>> >>>> We''re using it with ~9300 clients running Lustre 1.6.6 and haven''t >>>> identified any OFED 1.5.1-specific issues. If you''re using 1.6.x > and >>>> haven''t done so already, you''ll want to apply bug 19520 attach > 23498. >>>> >>>> We just deployed 1.8.2 on a separate cluster with ~130 clients and >>>> haven''t seen any OFED-specific issues there, either. While we did > see >>>> some failures when running acc-sm with the stack of software we use >>>> here, none of those had anything to do with the version of OFED we >>>> were running. >>>> >>> >> >> -- >> Jason Rappleye >> System Administrator >> NASA Advanced Supercomputing Division >> NASA Ames Research Center >> Moffett Field, CA 94035 >> >> >> >> >> >> > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss