Hi list, if a Lustre router is down, comes back to life and the servers do not actively test the routers periodically: is it possible to mark a Lustre router as "up"? Or to tell the servers to ping the router? Or can I enable the "router pinger" in a live system without unloading and loading the Lustre kernel modules? Regards, Michael -- Michael Kluge, M.Sc. Technische Universit?t Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax: (+49) 351 463-37773 e-mail: michael.kluge at tu-dresden.de WWW: http://www.tu-dresden.de/zih -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 5973 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110125/5553aec9/attachment.bin
You''ll want to add the "dead_router_check_interval" lnet module parameter as soon as you are able. As near as I can tell, without that there''s no automatic check to make sure the router is alive. I''ve had some success in getting machines to recognize that a router is alive again by doing an lctl ping of their side of a router (e.g., on a tcp0 client, `lctl ping <routerIP>@tcp0`, then `lctl ping <routerIP>@o2ib0` from an o2ib0 client). If you have a server/client version mismatch, where lctl ping returns a protocol error, you may be out of luck. -- Mike Shuey On Tue, Jan 25, 2011 at 8:38 AM, Michael Kluge <Michael.Kluge at tu-dresden.de> wrote:> Hi list, > > if a Lustre router is down, comes back to life and the servers do not > actively test the routers periodically: is it possible to mark a Lustre > router as "up"? Or to tell the servers to ping the router? > > Or can I enable the "router pinger" in a live system without unloading > and loading the Lustre kernel modules? > > > Regards, Michael > > -- > > Michael Kluge, M.Sc. > > Technische Universit?t Dresden > Center for Information Services and > High Performance Computing (ZIH) > D-01062 Dresden > Germany > > Contact: > Willersbau, Room A 208 > Phone: ?(+49) 351 463-34217 > Fax: ? ?(+49) 351 463-37773 > e-mail: michael.kluge at tu-dresden.de > WWW: ? ?http://www.tu-dresden.de/zih > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
I''ve found that even with the Protocal Error, it still works. -Jason -----Original Message----- From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Michael Shuey Sent: marted?, 25. gennaio 2011 14:45 To: Michael Kluge Cc: Lustre Diskussionsliste Subject: Re: [Lustre-discuss] "up" a router that is marked "down" You''ll want to add the "dead_router_check_interval" lnet module parameter as soon as you are able. As near as I can tell, without that there''s no automatic check to make sure the router is alive. I''ve had some success in getting machines to recognize that a router is alive again by doing an lctl ping of their side of a router (e.g., on a tcp0 client, `lctl ping <routerIP>@tcp0`, then `lctl ping <routerIP>@o2ib0` from an o2ib0 client). If you have a server/client version mismatch, where lctl ping returns a protocol error, you may be out of luck. -- Mike Shuey On Tue, Jan 25, 2011 at 8:38 AM, Michael Kluge <Michael.Kluge at tu-dresden.de> wrote:> Hi list, > > if a Lustre router is down, comes back to life and the servers do not > actively test the routers periodically: is it possible to mark a Lustre > router as "up"? Or to tell the servers to ping the router? > > Or can I enable the "router pinger" in a live system without unloading > and loading the Lustre kernel modules? > > > Regards, Michael > > -- > > Michael Kluge, M.Sc. > > Technische Universit?t Dresden > Center for Information Services and > High Performance Computing (ZIH) > D-01062 Dresden > Germany > > Contact: > Willersbau, Room A 208 > Phone: ?(+49) 351 463-34217 > Fax: ? ?(+49) 351 463-37773 > e-mail: michael.kluge at tu-dresden.de > WWW: ? ?http://www.tu-dresden.de/zih > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >_______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Jason, Michael, thanks y lot for your replies. I pinged everone from all directions but the router is still marked "down" on the client. I even removed and re-added the router entry via lctl --net tcp1 del_route xyz at o2ib and lctl --net tcp1 add_route xyz at o2ib . No luck. So I think I''ll wait for the next maintenance window. Oh, and I forgot to mention that the servers run a 1.6.7.2, the router as well and the clients 1.8.5. Works good so far. Thanks, Michael Am Dienstag, den 25.01.2011, 15:12 +0100 schrieb Temple Jason:> I''ve found that even with the Protocal Error, it still works. > > -Jason > > -----Original Message----- > From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Michael Shuey > Sent: marted?, 25. gennaio 2011 14:45 > To: Michael Kluge > Cc: Lustre Diskussionsliste > Subject: Re: [Lustre-discuss] "up" a router that is marked "down" > > You''ll want to add the "dead_router_check_interval" lnet module > parameter as soon as you are able. As near as I can tell, without > that there''s no automatic check to make sure the router is alive. > > I''ve had some success in getting machines to recognize that a router > is alive again by doing an lctl ping of their side of a router (e.g., > on a tcp0 client, `lctl ping <routerIP>@tcp0`, then `lctl ping > <routerIP>@o2ib0` from an o2ib0 client). If you have a server/client > version mismatch, where lctl ping returns a protocol error, you may be > out of luck. > > -- > Mike Shuey > > > > On Tue, Jan 25, 2011 at 8:38 AM, Michael Kluge > <Michael.Kluge at tu-dresden.de> wrote: > > Hi list, > > > > if a Lustre router is down, comes back to life and the servers do not > > actively test the routers periodically: is it possible to mark a Lustre > > router as "up"? Or to tell the servers to ping the router? > > > > Or can I enable the "router pinger" in a live system without unloading > > and loading the Lustre kernel modules? > > > > > > Regards, Michael > > > > -- > > > > Michael Kluge, M.Sc. > > > > Technische Universit?t Dresden > > Center for Information Services and > > High Performance Computing (ZIH) > > D-01062 Dresden > > Germany > > > > Contact: > > Willersbau, Room A 208 > > Phone: (+49) 351 463-34217 > > Fax: (+49) 351 463-37773 > > e-mail: michael.kluge at tu-dresden.de > > WWW: http://www.tu-dresden.de/zih > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- Michael Kluge, M.Sc. Technische Universit?t Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax: (+49) 351 463-37773 e-mail: michael.kluge at tu-dresden.de WWW: http://www.tu-dresden.de/zih -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 5973 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110125/332521c1/attachment.bin
Jeremy Filizetti
2011-Jan-25 23:55 UTC
[Lustre-discuss] "up" a router that is marked "down"
Though I think its marked as development or experimental in the Lustre documention or source "lctl set_route" has worked fine for me in the past with no issues. lctl set_route <nid> up is the syntax I believe. Jeremy On Tue, Jan 25, 2011 at 9:52 AM, Michael Kluge <Michael.Kluge at tu-dresden.de>wrote:> Jason, Michael, > > thanks y lot for your replies. I pinged everone from all directions but > the router is still marked "down" on the client. I even removed and > re-added the router entry via lctl --net tcp1 del_route xyz at o2ib and > lctl --net tcp1 add_route xyz at o2ib . No luck. So I think I''ll wait for > the next maintenance window. Oh, and I forgot to mention that the > servers run a 1.6.7.2, the router as well and the clients 1.8.5. Works > good so far. > > > Thanks, Michael > > > Am Dienstag, den 25.01.2011, 15:12 +0100 schrieb Temple Jason: > > I''ve found that even with the Protocal Error, it still works. > > > > -Jason > > > > -----Original Message----- > > From: lustre-discuss-bounces at lists.lustre.org [mailto: > lustre-discuss-bounces at lists.lustre.org] On Behalf Of Michael Shuey > > Sent: marted?, 25. gennaio 2011 14:45 > > To: Michael Kluge > > Cc: Lustre Diskussionsliste > > Subject: Re: [Lustre-discuss] "up" a router that is marked "down" > > > > You''ll want to add the "dead_router_check_interval" lnet module > > parameter as soon as you are able. As near as I can tell, without > > that there''s no automatic check to make sure the router is alive. > > > > I''ve had some success in getting machines to recognize that a router > > is alive again by doing an lctl ping of their side of a router (e.g., > > on a tcp0 client, `lctl ping <routerIP>@tcp0`, then `lctl ping > > <routerIP>@o2ib0` from an o2ib0 client). If you have a server/client > > version mismatch, where lctl ping returns a protocol error, you may be > > out of luck. > > > > -- > > Mike Shuey > > > > > > > > On Tue, Jan 25, 2011 at 8:38 AM, Michael Kluge > > <Michael.Kluge at tu-dresden.de> wrote: > > > Hi list, > > > > > > if a Lustre router is down, comes back to life and the servers do not > > > actively test the routers periodically: is it possible to mark a Lustre > > > router as "up"? Or to tell the servers to ping the router? > > > > > > Or can I enable the "router pinger" in a live system without unloading > > > and loading the Lustre kernel modules? > > > > > > > > > Regards, Michael > > > > > > -- > > > > > > Michael Kluge, M.Sc. > > > > > > Technische Universit?t Dresden > > > Center for Information Services and > > > High Performance Computing (ZIH) > > > D-01062 Dresden > > > Germany > > > > > > Contact: > > > Willersbau, Room A 208 > > > Phone: (+49) 351 463-34217 > > > Fax: (+49) 351 463-37773 > > > e-mail: michael.kluge at tu-dresden.de > > > WWW: http://www.tu-dresden.de/zih > > > > > > _______________________________________________ > > > Lustre-discuss mailing list > > > Lustre-discuss at lists.lustre.org > > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > -- > > Michael Kluge, M.Sc. > > Technische Universit?t Dresden > Center for Information Services and > High Performance Computing (ZIH) > D-01062 Dresden > Germany > > Contact: > Willersbau, Room A 208 > Phone: (+49) 351 463-34217 > Fax: (+49) 351 463-37773 > e-mail: michael.kluge at tu-dresden.de > WWW: http://www.tu-dresden.de/zih > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110125/c4b80560/attachment.html
Hi Jeremy, yup, it''s marked "==== obsolete (DANGEROUS) ====", whatever, it did the trick :) Thanks a lot, Michael Am Dienstag, den 25.01.2011, 18:55 -0500 schrieb Jeremy Filizetti:> Though I think its marked as development or experimental in the Lustre > documention or source "lctl set_route" has worked fine for me in the > past with no issues. > > lctl set_route <nid> up > > is the syntax I believe. > > Jeremy > > > On Tue, Jan 25, 2011 at 9:52 AM, Michael Kluge > <Michael.Kluge at tu-dresden.de> wrote: > Jason, Michael, > > thanks y lot for your replies. I pinged everone from all > directions but > the router is still marked "down" on the client. I even > removed and > re-added the router entry via lctl --net tcp1 del_route > xyz at o2ib and > lctl --net tcp1 add_route xyz at o2ib . No luck. So I think I''ll > wait for > the next maintenance window. Oh, and I forgot to mention that > the > servers run a 1.6.7.2, the router as well and the clients > 1.8.5. Works > good so far. > > > Thanks, Michael > > > Am Dienstag, den 25.01.2011, 15:12 +0100 schrieb Temple > Jason: > > > I''ve found that even with the Protocal Error, it still > works. > > > > -Jason > > > > -----Original Message----- > > From: lustre-discuss-bounces at lists.lustre.org > [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of > Michael Shuey > > Sent: marted?, 25. gennaio 2011 14:45 > > To: Michael Kluge > > Cc: Lustre Diskussionsliste > > Subject: Re: [Lustre-discuss] "up" a router that is marked > "down" > > > > You''ll want to add the "dead_router_check_interval" lnet > module > > parameter as soon as you are able. As near as I can tell, > without > > that there''s no automatic check to make sure the router is > alive. > > > > I''ve had some success in getting machines to recognize that > a router > > is alive again by doing an lctl ping of their side of a > router (e.g., > > on a tcp0 client, `lctl ping <routerIP>@tcp0`, then `lctl > ping > > <routerIP>@o2ib0` from an o2ib0 client). If you have a > server/client > > version mismatch, where lctl ping returns a protocol error, > you may be > > out of luck. > > > > -- > > Mike Shuey > > > > > > > > On Tue, Jan 25, 2011 at 8:38 AM, Michael Kluge > > <Michael.Kluge at tu-dresden.de> wrote: > > > Hi list, > > > > > > if a Lustre router is down, comes back to life and the > servers do not > > > actively test the routers periodically: is it possible to > mark a Lustre > > > router as "up"? Or to tell the servers to ping the router? > > > > > > Or can I enable the "router pinger" in a live system > without unloading > > > and loading the Lustre kernel modules? > > > > > > > > > Regards, Michael > > > > > > -- > > > > > > Michael Kluge, M.Sc. > > > > > > Technische Universit?t Dresden > > > Center for Information Services and > > > High Performance Computing (ZIH) > > > D-01062 Dresden > > > Germany > > > > > > Contact: > > > Willersbau, Room A 208 > > > Phone: (+49) 351 463-34217 > > > Fax: (+49) 351 463-37773 > > > e-mail: michael.kluge at tu-dresden.de > > > WWW: http://www.tu-dresden.de/zih > > > > > > _______________________________________________ > > > Lustre-discuss mailing list > > > Lustre-discuss at lists.lustre.org > > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > > -- > > > Michael Kluge, M.Sc. > > Technische Universit?t Dresden > Center for Information Services and > High Performance Computing (ZIH) > D-01062 Dresden > Germany > > Contact: > Willersbau, Room A 208 > Phone: (+49) 351 463-34217 > Fax: (+49) 351 463-37773 > e-mail: michael.kluge at tu-dresden.de > WWW: http://www.tu-dresden.de/zih > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >-- Michael Kluge, M.Sc. Technische Universit?t Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax: (+49) 351 463-37773 e-mail: michael.kluge at tu-dresden.de WWW: http://www.tu-dresden.de/zih -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 5973 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110126/ec03af69/attachment.bin