Adrian Ulrich
2010-Nov-19 15:50 UTC
[Lustre-discuss] Cannot mount MDS: Lustre: Denying initial registration attempt from nid 10.201.62.11@o2ib, specified as failover
Hi, Your MDS refuses to start after we tried to enable Quotas: What we did: # umount /lustre/mds # tunefs.lustre --param mdt.quota_type=ug /dev/md10 (as described in http://wiki.lustre.org/manual/LustreManual18_HTML/ConfiguringQuotas.html) # sync # mount -t lustre /dev/md10 /lustre/mds ---> at this point, the mds crashed <--- Now the MDS refuses to startup: Lustre: OBD class driver, http://www.lustre.org/ Lustre: Lustre Version: 1.8.4 Lustre: Build Version: 1.8.4-20100726215630-PRISTINE-2.6.18-194.3.1.el5_lustre.1.8.4 Lustre: Listener bound to ib0:10.201.62.11:987:mlx4_0 Lustre: Register global MR array, MR size: 0xffffffffffffffff, array size: 1 Lustre: Added LNI 10.201.62.11 at o2ib [8/64/0/180] Lustre: Added LNI 10.201.30.11 at tcp [8/256/0/180] Lustre: Accept secure, port 988 Lustre: Lustre Client File System; http://www.lustre.org/ init dynlocks cache ldiskfs created from ext3-2.6-rhel5 kjournald starting. Commit interval 5 seconds LDISKFS-fs warning: maximal mount count reached, running e2fsck is recommended LDISKFS FS on md10, internal journal LDISKFS-fs: recovery complete. LDISKFS-fs: mounted filesystem with ordered data mode. kjournald starting. Commit interval 5 seconds LDISKFS-fs warning: maximal mount count reached, running e2fsck is recommended LDISKFS FS on md10, internal journal LDISKFS-fs: mounted filesystem with ordered data mode. Lustre: MGS MGS started Lustre: MGC10.201.62.11 at o2ib: Reactivating import Lustre: Denying initial registration attempt from nid 10.201.62.11 at o2ib, specified as failover LustreError: 137-5: UUID ''lustre1-MDT0000_UUID'' is not available for connect (no target) LustreError: 6440:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-19) req at ffff81021986a000 x1352839800570911/t0 o38-><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1290181453 ref 1 fl Interpret:/0/0 rc -19/0 LustreError: 137-5: UUID ''lustre1-MDT0000_UUID'' is not available for connect (no target) LustreError: Skipped 1 previous similar message LustreError: 6441:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-19) req at ffff81021986ac00 x1352839303546603/t0 o38-><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1290181453 ref 1 fl Interpret:/0/0 rc -19/0 LustreError: 6441:0:(ldlm_lib.c:1914:target_send_reply_msg()) Skipped 1 previous similar message LustreError: 137-5: UUID ''lustre1-MDT0000_UUID'' is not available for connect (no target) LustreError: Skipped 17 previous similar messages LustreError: 6459:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-19) req at ffff8101ee758400 x1352840769468288/t0 o38-><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1290181454 ref 1 fl Interpret:/0/0 rc -19/0 LustreError: 6459:0:(ldlm_lib.c:1914:target_send_reply_msg()) Skipped 17 previous similar messages LustreError: 6423:0:(mgs_handler.c:671:mgs_handle()) MGS handle cmd=253 rc=-99 LustreError: 11-0: an error occurred while communicating with 0 at lo. The mgs_target_reg operation failed with -99 LustreError: 6177:0:(obd_mount.c:1097:server_start_targets()) Required registration failed for lustre1-MDT0000: -99 LustreError: 137-5: UUID ''lustre1-MDT0000_UUID'' is not available for connect (no target) LustreError: Skipped 17 previous similar messages LustreError: 6451:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-19) req at ffff8101ea921800 x1352839510145001/t0 o38-><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1290181455 ref 1 fl Interpret:/0/0 rc -19/0 LustreError: 6451:0:(ldlm_lib.c:1914:target_send_reply_msg()) Skipped 18 previous similar messages LustreError: 6177:0:(obd_mount.c:1655:server_fill_super()) Unable to start targets: -99 LustreError: 6177:0:(obd_mount.c:1438:server_put_super()) no obd lustre1-MDT0000 LustreError: 6177:0:(obd_mount.c:147:server_deregister_mount()) lustre1-MDT0000 not registered Lustre: MGS has stopped. LustreError: 137-5: UUID ''lustre1-MDT0000_UUID'' is not available for connect (no target) LustreError: 6464:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-19) req at ffff8101ec658000 x1352839459803293/t0 o38-><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1290181457 ref 1 fl Interpret:/0/0 rc -19/0 LustreError: 6464:0:(ldlm_lib.c:1914:target_send_reply_msg()) Skipped 50 previous similar messages LustreError: Skipped 58 previous similar messages Lustre: server umount lustre1-MDT0000 complete LustreError: 6177:0:(obd_mount.c:2050:lustre_fill_super()) Unable to mount (-99) Removing the quota params via # tunefs.lustre --erase-params --param="failover.node=10.201.62.11 at o2ib,10.201.30.11 at tcp failover.node=10.201.62.12 at o2ib,10.201.30.12 at tcp mdt.group_upcall=/usr/sbin/l_getgroups" /dev/md10 did not help. So what does ''Lustre: Denying initial registration attempt from nid 10.201.62.11 at o2ib, specified as failover'' exactly mean? This *is* 10.201.62.11 and tunefs shows: checking for existing Lustre data: found CONFIGS/mountdata Reading CONFIGS/mountdata Read previous values: Target: lustre1-MDT0000 Index: 0 Lustre FS: lustre1 Mount type: ldiskfs Flags: 0x45 (MDT MGS update ) Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr Parameters: failover.node=10.201.62.11 at o2ib,10.201.30.11 at tcp failover.node=10.201.62.12 at o2ib,10.201.30.12 at tcp mdt.group_upcall=/usr/sbin/l_getgroups Regards, Adrian
Adrian Ulrich
2010-Nov-20 07:17 UTC
[Lustre-discuss] Cannot mount MDS: Lustre: Denying initial registration attempt from nid 10.201.62.11@o2ib, specified as failover
> Lustre: Denying initial registration attempt from nid 10.201.62.11 at o2ib, specified as failoverI''ve removed the own NID from the MDT and all OSTs: afterwards i was able to mount everything and lustre is now working (again) since 8 hours. However, i have no idea: - if this is correct (will failover still work?) - why it worked before - why it didn''t work again after removing the quota option. -- RFC 1925: (11) Every old idea will be proposed again with a different name and a different presentation, regardless of whether it works.
Kevin Van Maren
2010-Nov-21 17:02 UTC
[Lustre-discuss] Cannot mount MDS: Lustre: Denying initial registration attempt from nid 10.201.62.11@o2ib, specified as failover
Okay, the local server is shown: Lustre: Added LNI 10.201.62.11 at o2ib [8/64/0/180] Lustre: Added LNI 10.201.30.11 at tcp [8/256/0/180] But you specified that as a failover node: # tunefs.lustre --erase-params --param="failover.node=10.201.62.11 at o2ib,10.201.30.11 at tcp failover.node=10.201.62.12 at o2ib,10.201.30.12 at tcp mdt.group_upcall=/usr/sbin/l_getgroups" /dev/md10 As you surmised, you do not add the local NIDs, as those are added implicitly by Lustre. Note that this means the first mount has to be done on the "primary" server, not the failover server. Not sure what you mean when you say it worked before -- did you specify both sets on your mkfs command line? Kevin Adrian Ulrich wrote:>> Lustre: Denying initial registration attempt from nid 10.201.62.11 at o2ib, specified as failover >> > > I''ve removed the own NID from the MDT and all OSTs: afterwards i was able to mount > everything and lustre is now working (again) since 8 hours. > > However, i have no idea: > > - if this is correct (will failover still work?) > - why it worked before > - why it didn''t work again after removing the quota option. > >
Adrian Ulrich
2010-Nov-21 20:06 UTC
[Lustre-discuss] Cannot mount MDS: Lustre: Denying initial registration attempt from nid 10.201.62.11@o2ib, specified as failover
Hi Kevin,> But you specified that as a failover node: > # tunefs.lustre --erase-params --param="failover.node=10.201.62.11 at o2ib,10.201.30.11 at tcp failover.node=10.201.62.12 at o2ib,10.201.30.12 at tcp mdt.group_upcall=/usr/sbin/l_getgroups" /dev/md10Well: First i was just running # tunefs.lustre --param mdt.quota_type=ug /dev/md10 and this alone was enough to break it. Then i tried to remove the quota-option with --erase-params and i''ve included both nodes (the primary + failover) because ''tunefs.lustre /dev/md10'' displayed them.> Not sure what you mean when you say it worked beforeIt worked before we added the *.quota_type parameters: This installation is over 1 year old, saw quite a few remounts and an upgrade from 1.8.1.1 -> 1.8.4.> did you specify both sets on your mkfs command line?The initial installation was done / dictated by the swiss branch of an (no longer existing) three-letter company. This command was used to create the filesystem on the MDS # FS_NAME="lustre1" # MGS_1="10.201.62.11 at o2ib0,10.201.30.11 at tcp0" # MGS_2="10.201.62.12 at o2ib0,10.201.30.12 at tcp0" # mkfs.lustre --reformat --fsname ${FS_NAME} --mdt --mgs --failnode=${MGS_1} --failnode=${MGS_2} /dev/md10 Regards and thanks, Adrian -- RFC 1925: (11) Every old idea will be proposed again with a different name and a different presentation, regardless of whether it works.
Kevin Van Maren
2010-Nov-21 20:57 UTC
[Lustre-discuss] Cannot mount MDS: Lustre: Denying initial registration attempt from nid 10.201.62.11@o2ib, specified as failover
Adrian Ulrich wrote:> Hi Kevin, > > >> But you specified that as a failover node: >> # tunefs.lustre --erase-params --param="failover.node=10.201.62.11 at o2ib,10.201.30.11 at tcp failover.node=10.201.62.12 at o2ib,10.201.30.12 at tcp mdt.group_upcall=/usr/sbin/l_getgroups" /dev/md10 >> > > Well: First i was just running > > # tunefs.lustre --param mdt.quota_type=ug /dev/md10 > > and this alone was enough to break it. >Not sure.>> did you specify both sets on your mkfs command line? >> > > The initial installation was done / dictated by the swiss branch of > an (no longer existing) three-letter company. This command was used > to create the filesystem on the MDS > > # FS_NAME="lustre1" > # MGS_1="10.201.62.11 at o2ib0,10.201.30.11 at tcp0" > # MGS_2="10.201.62.12 at o2ib0,10.201.30.12 at tcp0" > # mkfs.lustre --reformat --fsname ${FS_NAME} --mdt --mgs --failnode=${MGS_1} --failnode=${MGS_2} /dev/md10 >I haven''t done combined mdt/mgs for a while, so I can''t recall if you have to specify the mgs NIDs for the MDT when it is colocated with the MGS, but I think the command should have been more like: # mkfs.lustre --fsname ${FS_NAME} --mdt --mgs --failnode=${MGS_2} --mgsnode=${MGS_1} --mgsnode=${MGS_2} /dev/md10 with the mkfs/first mount on MGS_1. As I mentioned, you would not normally specify the mkfs/first-mount NIDs as failover parameters, as they are added automatically by Lustre. Kevin
Adrian Ulrich
2010-Nov-22 18:37 UTC
[Lustre-discuss] Cannot mount MDS: Lustre: Denying initial registration attempt from nid 10.201.62.11@o2ib, specified as failover
Hi Kevin,> > # tunefs.lustre --param mdt.quota_type=ug /dev/md10 > > and this alone was enough to break it. > Not sure.It is: From my .bash_history: # umount /dev/md10 # tunefs.lustre --param mdt.quota_type=ug /dev/md10 # mount -t lustre /dev/md10 --> broken <--> As I mentioned, you would not normally specify the mkfs/first-mount NIDs > as failover parameters, as they are added automatically by Lustre.Ok, good to know. Currently everything seems to work fine (again): Failover still works on the MDS and all OSS, but i still have no idea why tunefs broke it or/and why it worked with both NIDs. ++unsolved_mysteries; Regards, Adrian -- RFC 1925: (11) Every old idea will be proposed again with a different name and a different presentation, regardless of whether it works.