Adrian Ulrich
2010-Nov-19 15:50 UTC
[Lustre-discuss] Cannot mount MDS: Lustre: Denying initial registration attempt from nid 10.201.62.11@o2ib, specified as failover
Hi,
Your MDS refuses to start after we tried to enable Quotas:
What we did:
# umount /lustre/mds
# tunefs.lustre --param mdt.quota_type=ug /dev/md10 (as described in
http://wiki.lustre.org/manual/LustreManual18_HTML/ConfiguringQuotas.html)
# sync
# mount -t lustre /dev/md10 /lustre/mds
---> at this point, the mds crashed <---
Now the MDS refuses to startup:
Lustre: OBD class driver, http://www.lustre.org/
Lustre: Lustre Version: 1.8.4
Lustre: Build Version:
1.8.4-20100726215630-PRISTINE-2.6.18-194.3.1.el5_lustre.1.8.4
Lustre: Listener bound to ib0:10.201.62.11:987:mlx4_0
Lustre: Register global MR array, MR size: 0xffffffffffffffff, array size: 1
Lustre: Added LNI 10.201.62.11 at o2ib [8/64/0/180]
Lustre: Added LNI 10.201.30.11 at tcp [8/256/0/180]
Lustre: Accept secure, port 988
Lustre: Lustre Client File System; http://www.lustre.org/
init dynlocks cache
ldiskfs created from ext3-2.6-rhel5
kjournald starting. Commit interval 5 seconds
LDISKFS-fs warning: maximal mount count reached, running e2fsck is recommended
LDISKFS FS on md10, internal journal
LDISKFS-fs: recovery complete.
LDISKFS-fs: mounted filesystem with ordered data mode.
kjournald starting. Commit interval 5 seconds
LDISKFS-fs warning: maximal mount count reached, running e2fsck is recommended
LDISKFS FS on md10, internal journal
LDISKFS-fs: mounted filesystem with ordered data mode.
Lustre: MGS MGS started
Lustre: MGC10.201.62.11 at o2ib: Reactivating import
Lustre: Denying initial registration attempt from nid 10.201.62.11 at o2ib,
specified as failover
LustreError: 137-5: UUID ''lustre1-MDT0000_UUID'' is not
available for connect (no target)
LustreError: 6440:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing
error (-19) req at ffff81021986a000 x1352839800570911/t0
o38-><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1290181453 ref 1 fl
Interpret:/0/0 rc -19/0
LustreError: 137-5: UUID ''lustre1-MDT0000_UUID'' is not
available for connect (no target)
LustreError: Skipped 1 previous similar message
LustreError: 6441:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing
error (-19) req at ffff81021986ac00 x1352839303546603/t0
o38-><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1290181453 ref 1 fl
Interpret:/0/0 rc -19/0
LustreError: 6441:0:(ldlm_lib.c:1914:target_send_reply_msg()) Skipped 1 previous
similar message
LustreError: 137-5: UUID ''lustre1-MDT0000_UUID'' is not
available for connect (no target)
LustreError: Skipped 17 previous similar messages
LustreError: 6459:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing
error (-19) req at ffff8101ee758400 x1352840769468288/t0
o38-><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1290181454 ref 1 fl
Interpret:/0/0 rc -19/0
LustreError: 6459:0:(ldlm_lib.c:1914:target_send_reply_msg()) Skipped 17
previous similar messages
LustreError: 6423:0:(mgs_handler.c:671:mgs_handle()) MGS handle cmd=253 rc=-99
LustreError: 11-0: an error occurred while communicating with 0 at lo. The
mgs_target_reg operation failed with -99
LustreError: 6177:0:(obd_mount.c:1097:server_start_targets()) Required
registration failed for lustre1-MDT0000: -99
LustreError: 137-5: UUID ''lustre1-MDT0000_UUID'' is not
available for connect (no target)
LustreError: Skipped 17 previous similar messages
LustreError: 6451:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing
error (-19) req at ffff8101ea921800 x1352839510145001/t0
o38-><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1290181455 ref 1 fl
Interpret:/0/0 rc -19/0
LustreError: 6451:0:(ldlm_lib.c:1914:target_send_reply_msg()) Skipped 18
previous similar messages
LustreError: 6177:0:(obd_mount.c:1655:server_fill_super()) Unable to start
targets: -99
LustreError: 6177:0:(obd_mount.c:1438:server_put_super()) no obd lustre1-MDT0000
LustreError: 6177:0:(obd_mount.c:147:server_deregister_mount()) lustre1-MDT0000
not registered
Lustre: MGS has stopped.
LustreError: 137-5: UUID ''lustre1-MDT0000_UUID'' is not
available for connect (no target)
LustreError: 6464:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing
error (-19) req at ffff8101ec658000 x1352839459803293/t0
o38-><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1290181457 ref 1 fl
Interpret:/0/0 rc -19/0
LustreError: 6464:0:(ldlm_lib.c:1914:target_send_reply_msg()) Skipped 50
previous similar messages
LustreError: Skipped 58 previous similar messages
Lustre: server umount lustre1-MDT0000 complete
LustreError: 6177:0:(obd_mount.c:2050:lustre_fill_super()) Unable to
mount (-99)
Removing the quota params via
# tunefs.lustre --erase-params --param="failover.node=10.201.62.11 at
o2ib,10.201.30.11 at tcp failover.node=10.201.62.12 at o2ib,10.201.30.12 at tcp
mdt.group_upcall=/usr/sbin/l_getgroups" /dev/md10
did not help.
So what does ''Lustre: Denying initial registration attempt from nid
10.201.62.11 at o2ib, specified as failover'' exactly mean?
This *is* 10.201.62.11 and tunefs shows:
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata
Read previous values:
Target: lustre1-MDT0000
Index: 0
Lustre FS: lustre1
Mount type: ldiskfs
Flags: 0x45
(MDT MGS update )
Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
Parameters: failover.node=10.201.62.11 at o2ib,10.201.30.11 at tcp
failover.node=10.201.62.12 at o2ib,10.201.30.12 at tcp
mdt.group_upcall=/usr/sbin/l_getgroups
Regards,
Adrian
Adrian Ulrich
2010-Nov-20 07:17 UTC
[Lustre-discuss] Cannot mount MDS: Lustre: Denying initial registration attempt from nid 10.201.62.11@o2ib, specified as failover
> Lustre: Denying initial registration attempt from nid 10.201.62.11 at o2ib, specified as failoverI''ve removed the own NID from the MDT and all OSTs: afterwards i was able to mount everything and lustre is now working (again) since 8 hours. However, i have no idea: - if this is correct (will failover still work?) - why it worked before - why it didn''t work again after removing the quota option. -- RFC 1925: (11) Every old idea will be proposed again with a different name and a different presentation, regardless of whether it works.
Kevin Van Maren
2010-Nov-21 17:02 UTC
[Lustre-discuss] Cannot mount MDS: Lustre: Denying initial registration attempt from nid 10.201.62.11@o2ib, specified as failover
Okay, the local server is shown: Lustre: Added LNI 10.201.62.11 at o2ib [8/64/0/180] Lustre: Added LNI 10.201.30.11 at tcp [8/256/0/180] But you specified that as a failover node: # tunefs.lustre --erase-params --param="failover.node=10.201.62.11 at o2ib,10.201.30.11 at tcp failover.node=10.201.62.12 at o2ib,10.201.30.12 at tcp mdt.group_upcall=/usr/sbin/l_getgroups" /dev/md10 As you surmised, you do not add the local NIDs, as those are added implicitly by Lustre. Note that this means the first mount has to be done on the "primary" server, not the failover server. Not sure what you mean when you say it worked before -- did you specify both sets on your mkfs command line? Kevin Adrian Ulrich wrote:>> Lustre: Denying initial registration attempt from nid 10.201.62.11 at o2ib, specified as failover >> > > I''ve removed the own NID from the MDT and all OSTs: afterwards i was able to mount > everything and lustre is now working (again) since 8 hours. > > However, i have no idea: > > - if this is correct (will failover still work?) > - why it worked before > - why it didn''t work again after removing the quota option. > >
Adrian Ulrich
2010-Nov-21 20:06 UTC
[Lustre-discuss] Cannot mount MDS: Lustre: Denying initial registration attempt from nid 10.201.62.11@o2ib, specified as failover
Hi Kevin,> But you specified that as a failover node: > # tunefs.lustre --erase-params --param="failover.node=10.201.62.11 at o2ib,10.201.30.11 at tcp failover.node=10.201.62.12 at o2ib,10.201.30.12 at tcp mdt.group_upcall=/usr/sbin/l_getgroups" /dev/md10Well: First i was just running # tunefs.lustre --param mdt.quota_type=ug /dev/md10 and this alone was enough to break it. Then i tried to remove the quota-option with --erase-params and i''ve included both nodes (the primary + failover) because ''tunefs.lustre /dev/md10'' displayed them.> Not sure what you mean when you say it worked beforeIt worked before we added the *.quota_type parameters: This installation is over 1 year old, saw quite a few remounts and an upgrade from 1.8.1.1 -> 1.8.4.> did you specify both sets on your mkfs command line?The initial installation was done / dictated by the swiss branch of an (no longer existing) three-letter company. This command was used to create the filesystem on the MDS # FS_NAME="lustre1" # MGS_1="10.201.62.11 at o2ib0,10.201.30.11 at tcp0" # MGS_2="10.201.62.12 at o2ib0,10.201.30.12 at tcp0" # mkfs.lustre --reformat --fsname ${FS_NAME} --mdt --mgs --failnode=${MGS_1} --failnode=${MGS_2} /dev/md10 Regards and thanks, Adrian -- RFC 1925: (11) Every old idea will be proposed again with a different name and a different presentation, regardless of whether it works.
Kevin Van Maren
2010-Nov-21 20:57 UTC
[Lustre-discuss] Cannot mount MDS: Lustre: Denying initial registration attempt from nid 10.201.62.11@o2ib, specified as failover
Adrian Ulrich wrote:> Hi Kevin, > > >> But you specified that as a failover node: >> # tunefs.lustre --erase-params --param="failover.node=10.201.62.11 at o2ib,10.201.30.11 at tcp failover.node=10.201.62.12 at o2ib,10.201.30.12 at tcp mdt.group_upcall=/usr/sbin/l_getgroups" /dev/md10 >> > > Well: First i was just running > > # tunefs.lustre --param mdt.quota_type=ug /dev/md10 > > and this alone was enough to break it. >Not sure.>> did you specify both sets on your mkfs command line? >> > > The initial installation was done / dictated by the swiss branch of > an (no longer existing) three-letter company. This command was used > to create the filesystem on the MDS > > # FS_NAME="lustre1" > # MGS_1="10.201.62.11 at o2ib0,10.201.30.11 at tcp0" > # MGS_2="10.201.62.12 at o2ib0,10.201.30.12 at tcp0" > # mkfs.lustre --reformat --fsname ${FS_NAME} --mdt --mgs --failnode=${MGS_1} --failnode=${MGS_2} /dev/md10 >I haven''t done combined mdt/mgs for a while, so I can''t recall if you have to specify the mgs NIDs for the MDT when it is colocated with the MGS, but I think the command should have been more like: # mkfs.lustre --fsname ${FS_NAME} --mdt --mgs --failnode=${MGS_2} --mgsnode=${MGS_1} --mgsnode=${MGS_2} /dev/md10 with the mkfs/first mount on MGS_1. As I mentioned, you would not normally specify the mkfs/first-mount NIDs as failover parameters, as they are added automatically by Lustre. Kevin
Adrian Ulrich
2010-Nov-22 18:37 UTC
[Lustre-discuss] Cannot mount MDS: Lustre: Denying initial registration attempt from nid 10.201.62.11@o2ib, specified as failover
Hi Kevin,> > # tunefs.lustre --param mdt.quota_type=ug /dev/md10 > > and this alone was enough to break it. > Not sure.It is: From my .bash_history: # umount /dev/md10 # tunefs.lustre --param mdt.quota_type=ug /dev/md10 # mount -t lustre /dev/md10 --> broken <--> As I mentioned, you would not normally specify the mkfs/first-mount NIDs > as failover parameters, as they are added automatically by Lustre.Ok, good to know. Currently everything seems to work fine (again): Failover still works on the MDS and all OSS, but i still have no idea why tunefs broke it or/and why it worked with both NIDs. ++unsolved_mysteries; Regards, Adrian -- RFC 1925: (11) Every old idea will be proposed again with a different name and a different presentation, regardless of whether it works.