We''re running Lustre 1.8.7 on clients and servers. We recently added an 11th OSS to our lustre filesystem with 4 OSTs, unfortunately the modprobe.conf LNET line only listed an o2ib0(ib0) entry from testing, normally the line would look like: options lnet networks="o2ib0(ib0),tcp0(eth0),tcp1(eth2)" for IB, Gbit and 10Gbit respectively. As soon as the new OSTs on the 11th OSS were mounted and activated our 1gbit and 10gbit clients kernel panic''d, IB clients were fine. 1gbt and 10gbit clients would refuse to mount lustre after that since they couldn''t get to the OSS. I unmounted the OSTs on that OSS, fixed the modprobe.conf line, rebooted, and ran tunefs.lustre --erase-param --mgsnode=<ibaddr>@o2ib0,<gbitaddr>@tcp0,<10gbitaddr>@tcp1 --writeconf /dev/sd{b,c,d,e} Where <xxxaddr> is the appropriate IP address. That seemed to complete without issue and tunefs reports: Parameters: mgsnode=<ibaddr>@o2ib0,<gbitaddr>@tcp0,<10gbitaddr>@tcp1 as expected. Unfortunately 1gbit and 10gbit clients still refuse to mount lustre. mount.lustre: mount <ipaddr>@tcp0:/lustre at /.lustre/mountpoint failed: No such file or directory Is the MGS specification correct? Is the filesystem name correct? If upgrading, is the copied client log valid? (see upgrade docs) The OSS can ping clients on the 1gbit and 10gbit networks so routing and networking is fine. I''m sure I''m simply panicked and missing something obvious. What is the proper procedure to fix this mess. I thought the tunefs.lustre would do it but it has not. James Robnett NRAO/AOC
A bit more information. The clients panicked when the first OST on the new OSS was added, that''s OST00028. They now complain about getting to OST00028 when remounting Lustre. You can see in the logs below the OSS still thinks recovery for this client should be done over IB. Specifically the line: Jul 31 10:08:20 apathy kernel: Lustre: cmd=cf003 0:lustre-OST0028-osc 1:lustre-OST0028_UUID 2:<ibaddr>@o2ib That''s the IB interface for that OST. I suspect I have to unmount that OST, clears it''s logs and remount. Unfortunately I only have the most basic understanding of that procedure. If that''s the right procedure and somebody has the proper syntax I''m all ears. James Jul 31 10:08:20 apathy kernel: Lustre: MGC10.64.1.161@tcp: Reactivating import Jul 31 10:08:20 apathy kernel: LustreError: 5023:0:(ldlm_lib.c:331:client_obd_setup()) can''t add initial connection Jul 31 10:08:20 apathy kernel: LustreError: 5023:0:(obd_config.c:372:class_setup()) setup lustre-OST0028-osc-ffff88022dcc4000 failed (-2) Jul 31 10:08:20 apathy kernel: LustreError: 5023:0:(obd_config.c:1199:class_config_llog_handler()) Err -2 on cfg command: Jul 31 10:08:20 apathy kernel: Lustre: cmd=cf003 0:lustre-OST0028-osc 1:lustre-OST0028_UUID 2:<ibaddr>@o2ib Jul 31 10:08:20 apathy kernel: LustreError: 15c-8: MGC<gbitaddr>@tcp: The configuration from log ''lustre-client'' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. Jul 31 10:08:20 apathy kernel: LustreError: 5013:0:(llite_lib.c:1095:ll_fill_super()) Unable to process log: -2 Jul 31 10:08:20 apathy kernel: LustreError: 5013:0:(lov_obd.c:1009:lov_cleanup()) lov tgt 0 not cleaned! deathrow=0, lovrc=1 Jul 31 10:08:20 apathy kernel: LustreError: 5013:0:(lov_obd.c:1009:lov_cleanup()) Skipped 39 previous similar messages Jul 31 10:08:20 apathy kernel: LustreError: 5013:0:(mdc_request.c:1498:mdc_precleanup()) client import never connected Jul 31 10:08:20 apathy kernel: LustreError: 5013:0:(obd_config.c:443:class_cleanup()) Device 43 not setup Jul 31 10:08:20 apathy kernel: LustreError: 5013:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway Jul 31 10:08:20 apathy kernel: LustreError: 5013:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 Jul 31 10:08:20 apathy kernel: Lustre: client lustre-client(ffff88022dcc4000) umount complete Jul 31 10:08:20 apathy kernel: LustreError: 5013:0:(obd_mount.c:2065:lustre_fill_super()) Unable to mount (-2) On 07/31/2013 09:55 AM, James Robnett wrote:> > We''re running Lustre 1.8.7 on clients and servers. > > We recently added an 11th OSS to our lustre filesystem with 4 OSTs, > unfortunately the modprobe.conf LNET line only listed an o2ib0(ib0) > entry from testing, normally the line would look like: > > options lnet networks="o2ib0(ib0),tcp0(eth0),tcp1(eth2)" > > for IB, Gbit and 10Gbit respectively. > > As soon as the new OSTs on the 11th OSS were mounted and activated > our 1gbit and 10gbit clients kernel panic''d, IB clients were fine. > 1gbt and 10gbit clients would refuse to mount lustre after that > since they couldn''t get to the OSS. > > I unmounted the OSTs on that OSS, fixed the modprobe.conf line, > rebooted, and ran > > tunefs.lustre --erase-param > --mgsnode=<ibaddr>@o2ib0,<gbitaddr>@tcp0,<10gbitaddr>@tcp1 --writeconf > /dev/sd{b,c,d,e} > > Where <xxxaddr> is the appropriate IP address. > > That seemed to complete without issue and tunefs reports: > > Parameters: > mgsnode=<ibaddr>@o2ib0,<gbitaddr>@tcp0,<10gbitaddr>@tcp1 > > as expected. > > Unfortunately 1gbit and 10gbit clients still refuse to mount lustre. > > mount.lustre: mount <ipaddr>@tcp0:/lustre at /.lustre/mountpoint failed: > No such file or directory > Is the MGS specification correct? > Is the filesystem name correct? > If upgrading, is the copied client log valid? (see upgrade docs) > > The OSS can ping clients on the 1gbit and 10gbit networks so routing > and networking is fine. > > I''m sure I''m simply panicked and missing something obvious. What > is the proper procedure to fix this mess. I thought the tunefs.lustre > would do it but it has not. > > James Robnett > NRAO/AOC
I''m now suspicious that I need to unmount all the OSSes (for correctness), unmount the MDS and run tunefs.lustre --writeconf /dev/md0 on it to clear the logs and then remount. Note we have a combined MDS/MGS. James> > On 07/31/2013 09:55 AM, James Robnett wrote: >> >> We''re running Lustre 1.8.7 on clients and servers. >> >> We recently added an 11th OSS to our lustre filesystem with 4 OSTs, >> unfortunately the modprobe.conf LNET line only listed an o2ib0(ib0) >> entry from testing, normally the line would look like: >> >> options lnet networks="o2ib0(ib0),tcp0(eth0),tcp1(eth2)" >> >> for IB, Gbit and 10Gbit respectively. >> >> As soon as the new OSTs on the 11th OSS were mounted and activated >> our 1gbit and 10gbit clients kernel panic''d, IB clients were fine. >> 1gbt and 10gbit clients would refuse to mount lustre after that >> since they couldn''t get to the OSS. >> >> I unmounted the OSTs on that OSS, fixed the modprobe.conf line, >> rebooted, and ran >> >> tunefs.lustre --erase-param >> --mgsnode=<ibaddr>@o2ib0,<gbitaddr>@tcp0,<10gbitaddr>@tcp1 --writeconf >> /dev/sd{b,c,d,e} >> >> Where <xxxaddr> is the appropriate IP address. >> >> That seemed to complete without issue and tunefs reports: >> >> Parameters: >> mgsnode=<ibaddr>@o2ib0,<gbitaddr>@tcp0,<10gbitaddr>@tcp1 >> >> as expected. >> >> Unfortunately 1gbit and 10gbit clients still refuse to mount lustre. >> >> mount.lustre: mount <ipaddr>@tcp0:/lustre at /.lustre/mountpoint failed: >> No such file or directory >> Is the MGS specification correct? >> Is the filesystem name correct? >> If upgrading, is the copied client log valid? (see upgrade docs) >> >> The OSS can ping clients on the 1gbit and 10gbit networks so routing >> and networking is fine. >> >> I''m sure I''m simply panicked and missing something obvious. What >> is the proper procedure to fix this mess. I thought the tunefs.lustre >> would do it but it has not. >> >> James Robnett >> NRAO/AOC > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
On 7/31/13 10:37 AM, "James Robnett" <jrobnett-I/vjUJLpcmb2fBVCVOL8/A@public.gmane.org> wrote:> >I''m now suspicious that I need to unmount all the OSSes (for >correctness), unmount the MDS and run > >tunefs.lustre --writeconf /dev/md0 > >on it to clear the logs and then remount. > >Note we have a combined MDS/MGS.Yes. Since the configuration is held on the MDS, you need to do the --writeconf, then remount the servers. Procedure should be in the Lustre Manual Cliffw> >James > >> >> On 07/31/2013 09:55 AM, James Robnett wrote: >>> >>> We''re running Lustre 1.8.7 on clients and servers. >>> >>> We recently added an 11th OSS to our lustre filesystem with 4 OSTs, >>> unfortunately the modprobe.conf LNET line only listed an o2ib0(ib0) >>> entry from testing, normally the line would look like: >>> >>> options lnet networks="o2ib0(ib0),tcp0(eth0),tcp1(eth2)" >>> >>> for IB, Gbit and 10Gbit respectively. >>> >>> As soon as the new OSTs on the 11th OSS were mounted and activated >>> our 1gbit and 10gbit clients kernel panic''d, IB clients were fine. >>> 1gbt and 10gbit clients would refuse to mount lustre after that >>> since they couldn''t get to the OSS. >>> >>> I unmounted the OSTs on that OSS, fixed the modprobe.conf line, >>> rebooted, and ran >>> >>> tunefs.lustre --erase-param >>> --mgsnode=<ibaddr>@o2ib0,<gbitaddr>@tcp0,<10gbitaddr>@tcp1 --writeconf >>> /dev/sd{b,c,d,e} >>> >>> Where <xxxaddr> is the appropriate IP address. >>> >>> That seemed to complete without issue and tunefs reports: >>> >>> Parameters: >>> mgsnode=<ibaddr>@o2ib0,<gbitaddr>@tcp0,<10gbitaddr>@tcp1 >>> >>> as expected. >>> >>> Unfortunately 1gbit and 10gbit clients still refuse to mount lustre. >>> >>> mount.lustre: mount <ipaddr>@tcp0:/lustre at /.lustre/mountpoint >>>failed: >>> No such file or directory >>> Is the MGS specification correct? >>> Is the filesystem name correct? >>> If upgrading, is the copied client log valid? (see upgrade docs) >>> >>> The OSS can ping clients on the 1gbit and 10gbit networks so routing >>> and networking is fine. >>> >>> I''m sure I''m simply panicked and missing something obvious. What >>> is the proper procedure to fix this mess. I thought the tunefs.lustre >>> would do it but it has not. >>> >>> James Robnett >>> NRAO/AOC >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >_______________________________________________ >Lustre-discuss mailing list >Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org >http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Many thanks Cliff. One other question ... The manual is fairly insistent that the filesystem be unmounted on the clients as well as the OSTs, MDT etc are unmounted. The vast majority of our clients are Infiniband and working fine. Given the nature of the problem (i.e the new OSS only knew about IB) do you think it''s critical that we unmount the IB clients. The 1gbit and 10gbit clients are naturally unmounted due to the kernel panic. My preference would be to simply unmount each OST on all the OSSes, unmount the MDT, run writeconf on the MDS/MGS, remount the MDT and the mount the OSTs. I''d leave the IB connected clients alone. They would restore connectivity after the MDS and OSSes came back up. The whole process would only take a few minutes, less than the recovery time. Or do you think I''m just asking for trouble and should shut everything down. That''s a painful process for the clients but doable. james ps: I assume I have to actually unmount the OSTs. I could believe, given this instance, it might be ok/safe to just unmount the MDS, run write conf on it and remount. resending since I failed to reply to the list. On 07/31/2013 12:43 PM, White, Cliff wrote:> On 7/31/13 10:37 AM, "James Robnett" <jrobnett-I/vjUJLpcmb2fBVCVOL8/A@public.gmane.org> wrote: > >> >> I''m now suspicious that I need to unmount all the OSSes (for >> correctness), unmount the MDS and run >> >> tunefs.lustre --writeconf /dev/md0 >> >> on it to clear the logs and then remount. >> >> Note we have a combined MDS/MGS. > > Yes. Since the configuration is held on the MDS, you need to do the > --writeconf, then remount the servers. > Procedure should be in the Lustre Manual > Cliffw > >> >> James >>