After a network event (switches bouncing) looks like our mds got borked somewhere, from all the random failovers (switches came up and down rapidly over a few hours). Now we can not mount the mds, when we do we get the following errors: Aug 19 12:37:39 mds2 kernel: LustreError: 137-5: UUID ''nobackup- MDT0000_UUID'' is not available for connect (no target) Aug 19 12:37:39 mds2 kernel: LustreError: 7455:0:(ldlm_lib.c: 1619:target_send_reply_msg()) @@@ processing error (-19) req at 000001037c9db600 x85226/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1250699959 ref 1 fl Interpret:/0/0 rc -19/0 Aug 19 12:37:39 mds2 kernel: LustreError: 137-5: UUID ''nobackup- MDT0000_UUID'' is not available for connect (no target) Aug 19 12:37:39 mds2 kernel: LustreError: 7456:0:(ldlm_lib.c: 1619:target_send_reply_msg()) @@@ processing error (-19) req at 00000104163a6000 x47117/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1250699959 ref 1 fl Interpret:/0/0 rc -19/0 Aug 19 12:37:39 mds2 kernel: LustreError: 137-5: UUID ''nobackup- MDT0000_UUID'' is not available for connect (no target)Aug 19 12:37:39 mds2 kernel: LustreError: Skipped 11 previous similar messages Aug 19 12:37:39 mds2 kernel: LustreError: 7468:0:(ldlm_lib.c: 1619:target_send_reply_msg()) @@@ processing error (-19) req at 0000010350a4d200 x81788/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1250699959 ref 1 fl Interpret:/0/0 rc -19/0 Aug 19 12:37:39 mds2 kernel: LustreError: 7468:0:(ldlm_lib.c: 1619:target_send_reply_msg()) Skipped 11 previous similar messages Aug 19 12:37:40 mds2 kernel: LustreError: 137-5: UUID ''nobackup- MDT0000_UUID'' is not available for connect (no target) Aug 19 12:37:40 mds2 kernel: LustreError: Skipped 18 previous similar messages Aug 19 12:37:40 mds2 kernel: LustreError: 7455:0:(ldlm_lib.c: 1619:target_send_reply_msg()) @@@ processing error (-19) req at 0000010414dc1850 x81855/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1250699960 ref 1 fl Interpret:/0/0 rc -19/0Aug 19 12:37:40 mds2 kernel: LustreError: 7455:0:(ldlm_lib.c:1619:target_send_reply_msg()) Skipped 18 previous similar messages Aug 19 12:37:42 mds2 kernel: LustreError: 137-5: UUID ''nobackup- MDT0000_UUID'' is not available for connect (no target) Aug 19 12:37:42 mds2 kernel: LustreError: Skipped 42 previous similar messages Aug 19 12:37:42 mds2 kernel: LustreError: 7466:0:(ldlm_lib.c: 1619:target_send_reply_msg()) @@@ processing error (-19) req at 000001037c9db600 x77144/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1250699962 ref 1 fl Interpret:/0/0 rc -19/0 Aug 19 12:37:42 mds2 kernel: LustreError: 7466:0:(ldlm_lib.c: 1619:target_send_reply_msg()) Skipped 42 previous similar messages Aug 19 12:37:43 mds2 kernel: Lustre: Request x3 sent from MGC10.164.3.246 at tcp to NID 10.164.3.246 at tcp 5s ago has timed out (limit 5s). Aug 19 12:37:43 mds2 kernel: Lustre: Changing connection for MGC10.164.3.246 at tcp to MGC10.164.3.246 at tcp_1/0 at lo Aug 19 12:37:43 mds2 kernel: Lustre: Enabling user_xattr Aug 19 12:37:43 mds2 kernel: Lustre: 7524:0:(mds_fs.c: 493:mds_init_server_data()) RECOVERY: service nobackup-MDT0000, 439 recoverable clients, last_transno 3647966566 Aug 19 12:37:43 mds2 kernel: Lustre: MDT nobackup-MDT0000 now serving dev (nobackup-MDT0000/57dddb69-2475-b551-4100-e045f91ce38c), but will be in recovery for at least 5:00, or until 439 clients reconnect. During this time new clients will not be allowed to connect. Recovery progress can be monitored by watching / proc/fs/lustre/mds/nobackup-MDT0000/rec overy_status. Aug 19 12:37:43 mds2 kernel: Lustre: 7524:0:(lproc_mds.c: 273:lprocfs_wr_group_upcall()) nobackup-MDT0000: group upcall set to / usr/sbin/l_getgroups Aug 19 12:37:43 mds2 kernel: Lustre: nobackup-MDT0000.mdt: set parameter group_upcall=/usr/sbin/l_getgroupsAug 19 12:37:43 mds2 kernel: Lustre: 7524:0:(mds_lov.c:1070:mds_notify()) MDS nobackup- MDT0000: in recovery, not resetting orphans on nobackup-OST0000_UUID Aug 19 12:37:43 mds2 kernel: Lustre: nobackup-MDT0000: temporarily refusing client connection from 10.164.1.104 at tcp Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_lvfs.c: 612:llog_lvfs_create()) error looking up logfile 0xf150010:0x80d24629: rc -2 Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_cat.c: 176:llog_cat_id2handle()) error opening log id 0xf150010:80d24629: rc -2 Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_obd.c: 262:cat_cancel_cb()) Cannot find handle for log 0xf150010 Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(llog_obd.c: 329:llog_obd_origin_setup()) llog_process with cat_cancel_cb failed: -2 Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(osc_request.c: 3664:osc_llog_init()) failed LLOG_MDS_OST_ORIG_CTXT Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(osc_request.c: 3675:osc_llog_init()) osc ''nobackup-OST0000-osc'' tgt ''nobackup- MDT0000'' cnt 1 catid 00000101e1d979e8 rc=-2 Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(osc_request.c: 3677:osc_llog_init()) logid 0xf150002:0x9642a0ac Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(lov_log.c: 230:lov_llog_init()) error osc_llog_init idx 0 osc ''nobackup-OST0000- osc'' tgt ''nobackup-MDT0000'' (rc=-2) Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(mds_log.c: 220:mds_llog_init()) lov_llog_init err -2 Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(llog_obd.c: 417:llog_cat_initialize()) rc: -2 Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(lov_obd.c: 727:lov_add_target()) add failed (-2), deleting nobackup-OST0000_UUID Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(obd_config.c: 1093:class_config_llog_handler()) Err -2 on cfg command: Aug 19 12:37:43 mds2 kernel: Lustre: cmd=cf00d 0:nobackup-mdtlov 1:nobackup-OST0000_UUID 2:0 3:1 Aug 19 12:37:43 mds2 kernel: LustreError: 15c-8: MGC10.164.3.246 at tcp: The configuration from log ''nobackup-MDT0000'' failed (-2). This may be the result of communication errors b etween this node and the MGS, a bad configuration, or other errors. See the syslog for more information. Aug 19 12:37:43 mds2 kernel: LustreError: 7438:0:(obd_mount.c: 1113:server_start_targets()) failed to start server nobackup-MDT0000: -2 Aug 19 12:37:44 mds2 kernel: LustreError: 7438:0:(obd_mount.c: 1623:server_fill_super()) Unable to start targets: -2 Aug 19 12:37:44 mds2 kernel: Lustre: Failing over nobackup-MDT0000 Aug 19 12:37:44 mds2 kernel: Lustre: *** setting obd nobackup-MDT0000 device ''unknown-block(8,16)'' read-only *** We have ran e2fsck on the volume, found a few errors and corrected. But the problem presists. We also tried mounting with -o abort_recov this resulted in a assertion (lbug) and does not work. ANy thoughts? The lines: Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_lvfs.c: 612:llog_lvfs_create()) error looking up logfile 0xf150010:0x80d24629: rc -2 Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_cat.c: 176:llog_cat_id2handle()) error opening log id 0xf150010:80d24629: rc -2 Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_obd.c: 262:cat_cancel_cb()) Cannot find handle for log 0xf150010 Catch my attention, Thanks, we are running 1.6.6 Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp at umich.edu (734)936-1985
On Aug 19, 2009 12:57 -0400, Brock Palen wrote:> After a network event (switches bouncing) looks like our mds got > borked somewhere, from all the random failovers (switches came up and > down rapidly over a few hours). > > Now we can not mount the mds, when we do we get the following errors: > > Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_lvfs.c: > 612:llog_lvfs_create()) error looking up logfile 0xf150010:0x80d24629: > rc -2Looks like a problem initializing the orphan object cleanup log. This shouldn''t be fatal, in that a new log file count be created. But, it dutifully reports the error up the (long) chain and fails the mount.> We have ran e2fsck on the volume, found a few errors and corrected. > But the problem presists. We also tried mounting with -o abort_recov > this resulted in a assertion (lbug) and does not work.That shouldn''t happen either...> ANy thoughts? The lines:You can delete the "CATALOGS" file on the MDS, it should start up OK. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Some additional details, I mounted the mds as ldiskfs and deleted the files in OBJECTS/* and CATALOGS, Remounted as lustre, same issue. I also did a write conf, restarted all the servers, saw messages on the MGS, that new config logs were being created, but still same error on the mds trying to start up. Is there a way to get lustre to stop trying to open 0xf150010:80d24629: ? And not go though recovery? If not, can I format a new mds, and just untar ROOTS/ and apply the extended attributes to ROOTS from the old mds filesystem? Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp at umich.edu (734)936-1985 On Aug 19, 2009, at 12:57 PM, Brock Palen wrote:> After a network event (switches bouncing) looks like our mds got > borked somewhere, from all the random failovers (switches came up and > down rapidly over a few hours). > > Now we can not mount the mds, when we do we get the following errors: > > Aug 19 12:37:39 mds2 kernel: LustreError: 137-5: UUID ''nobackup- > MDT0000_UUID'' is not available for connect (no target) > Aug 19 12:37:39 mds2 kernel: LustreError: 7455:0:(ldlm_lib.c: > 1619:target_send_reply_msg()) @@@ processing error (-19) > req at 000001037c9db600 x85226/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl > 1250699959 ref 1 fl Interpret:/0/0 rc -19/0 > Aug 19 12:37:39 mds2 kernel: LustreError: 137-5: UUID ''nobackup- > MDT0000_UUID'' is not available for connect (no target) > Aug 19 12:37:39 mds2 kernel: LustreError: 7456:0:(ldlm_lib.c: > 1619:target_send_reply_msg()) @@@ processing error (-19) > req at 00000104163a6000 x47117/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl > 1250699959 ref 1 fl Interpret:/0/0 rc -19/0 > Aug 19 12:37:39 mds2 kernel: LustreError: 137-5: UUID ''nobackup- > MDT0000_UUID'' is not available for connect (no target)Aug 19 12:37:39 > mds2 kernel: LustreError: Skipped 11 previous similar messages > Aug 19 12:37:39 mds2 kernel: LustreError: 7468:0:(ldlm_lib.c: > 1619:target_send_reply_msg()) @@@ processing error (-19) > req at 0000010350a4d200 x81788/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl > 1250699959 ref 1 fl Interpret:/0/0 rc -19/0 > Aug 19 12:37:39 mds2 kernel: LustreError: 7468:0:(ldlm_lib.c: > 1619:target_send_reply_msg()) Skipped 11 previous similar messages > Aug 19 12:37:40 mds2 kernel: LustreError: 137-5: UUID ''nobackup- > MDT0000_UUID'' is not available for connect (no target) > Aug 19 12:37:40 mds2 kernel: LustreError: Skipped 18 previous similar > messages > Aug 19 12:37:40 mds2 kernel: LustreError: 7455:0:(ldlm_lib.c: > 1619:target_send_reply_msg()) @@@ processing error (-19) > req at 0000010414dc1850 x81855/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl > 1250699960 ref 1 fl Interpret:/0/0 rc -19/0Aug 19 12:37:40 mds2 > kernel: LustreError: 7455:0:(ldlm_lib.c:1619:target_send_reply_msg()) > Skipped 18 previous similar messages > Aug 19 12:37:42 mds2 kernel: LustreError: 137-5: UUID ''nobackup- > MDT0000_UUID'' is not available for connect (no target) > Aug 19 12:37:42 mds2 kernel: LustreError: Skipped 42 previous similar > messages > Aug 19 12:37:42 mds2 kernel: LustreError: 7466:0:(ldlm_lib.c: > 1619:target_send_reply_msg()) @@@ processing error (-19) > req at 000001037c9db600 x77144/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl > 1250699962 ref 1 fl Interpret:/0/0 rc -19/0 > Aug 19 12:37:42 mds2 kernel: LustreError: 7466:0:(ldlm_lib.c: > 1619:target_send_reply_msg()) Skipped 42 previous similar messages > Aug 19 12:37:43 mds2 kernel: Lustre: Request x3 sent from > MGC10.164.3.246 at tcp to NID 10.164.3.246 at tcp 5s ago has timed out > (limit 5s). > Aug 19 12:37:43 mds2 kernel: Lustre: Changing connection for > MGC10.164.3.246 at tcp to MGC10.164.3.246 at tcp_1/0 at lo > Aug 19 12:37:43 mds2 kernel: Lustre: Enabling user_xattr > Aug 19 12:37:43 mds2 kernel: Lustre: 7524:0:(mds_fs.c: > 493:mds_init_server_data()) RECOVERY: service nobackup-MDT0000, 439 > recoverable clients, last_transno 3647966566 > Aug 19 12:37:43 mds2 kernel: Lustre: MDT nobackup-MDT0000 now serving > dev (nobackup-MDT0000/57dddb69-2475-b551-4100-e045f91ce38c), but will > be in recovery for at least 5:00, or > until 439 clients reconnect. During this time new clients will not be > allowed to connect. Recovery progress can be monitored by watching / > proc/fs/lustre/mds/nobackup-MDT0000/rec > overy_status. > Aug 19 12:37:43 mds2 kernel: Lustre: 7524:0:(lproc_mds.c: > 273:lprocfs_wr_group_upcall()) nobackup-MDT0000: group upcall set to / > usr/sbin/l_getgroups > Aug 19 12:37:43 mds2 kernel: Lustre: nobackup-MDT0000.mdt: set > parameter group_upcall=/usr/sbin/l_getgroupsAug 19 12:37:43 mds2 > kernel: Lustre: 7524:0:(mds_lov.c:1070:mds_notify()) MDS nobackup- > MDT0000: in recovery, not resetting orphans on nobackup-OST0000_UUID > Aug 19 12:37:43 mds2 kernel: Lustre: nobackup-MDT0000: temporarily > refusing client connection from 10.164.1.104 at tcp > Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_lvfs.c: > 612:llog_lvfs_create()) error looking up logfile 0xf150010:0x80d24629: > rc -2 > Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_cat.c: > 176:llog_cat_id2handle()) error opening log id 0xf150010:80d24629: > rc -2 > Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_obd.c: > 262:cat_cancel_cb()) Cannot find handle for log 0xf150010 > Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(llog_obd.c: > 329:llog_obd_origin_setup()) llog_process with cat_cancel_cb failed: > -2 > Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(osc_request.c: > 3664:osc_llog_init()) failed LLOG_MDS_OST_ORIG_CTXT > Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(osc_request.c: > 3675:osc_llog_init()) osc ''nobackup-OST0000-osc'' tgt ''nobackup- > MDT0000'' cnt 1 catid 00000101e1d979e8 rc=-2 > Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(osc_request.c: > 3677:osc_llog_init()) logid 0xf150002:0x9642a0ac > Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(lov_log.c: > 230:lov_llog_init()) error osc_llog_init idx 0 osc ''nobackup-OST0000- > osc'' tgt ''nobackup-MDT0000'' (rc=-2) > Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(mds_log.c: > 220:mds_llog_init()) lov_llog_init err -2 > Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(llog_obd.c: > 417:llog_cat_initialize()) rc: -2 > Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(lov_obd.c: > 727:lov_add_target()) add failed (-2), deleting nobackup-OST0000_UUID > Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(obd_config.c: > 1093:class_config_llog_handler()) Err -2 on cfg command: > Aug 19 12:37:43 mds2 kernel: Lustre: cmd=cf00d 0:nobackup-mdtlov > 1:nobackup-OST0000_UUID 2:0 3:1 > Aug 19 12:37:43 mds2 kernel: LustreError: 15c-8: MGC10.164.3.246 at tcp: > The configuration from log ''nobackup-MDT0000'' failed (-2). This may be > the result of communication errors b > etween this node and the MGS, a bad configuration, or other errors. > See the syslog for more information. > Aug 19 12:37:43 mds2 kernel: LustreError: 7438:0:(obd_mount.c: > 1113:server_start_targets()) failed to start server nobackup- > MDT0000: -2 > Aug 19 12:37:44 mds2 kernel: LustreError: 7438:0:(obd_mount.c: > 1623:server_fill_super()) Unable to start targets: -2 > Aug 19 12:37:44 mds2 kernel: Lustre: Failing over nobackup-MDT0000 > Aug 19 12:37:44 mds2 kernel: Lustre: *** setting obd nobackup-MDT0000 > device ''unknown-block(8,16)'' read-only *** > > We have ran e2fsck on the volume, found a few errors and corrected. > But the problem presists. We also tried mounting with -o abort_recov > this resulted in a assertion (lbug) and does not work. > ANy thoughts? The lines: > Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_lvfs.c: > 612:llog_lvfs_create()) error looking up logfile 0xf150010:0x80d24629: > rc -2 > Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_cat.c: > 176:llog_cat_id2handle()) error opening log id 0xf150010:80d24629: > rc -2 > Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_obd.c: > 262:cat_cancel_cb()) Cannot find handle for log 0xf150010 > > Catch my attention, > Thanks, we are running 1.6.6 > > > Brock Palen > www.umich.edu/~brockp > Center for Advanced Computing > brockp at umich.edu > (734)936-1985 > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
On Aug 20, 2009 09:09 -0400, Brock Palen wrote:> Some additional details, > I mounted the mds as ldiskfs and deleted the files in OBJECTS/* and > CATALOGS, > Remounted as lustre, same issue. > I also did a write conf, restarted all the servers, saw messages on > the MGS, that new config logs were being created, but still same error > on the mds trying to start up. > Is there a way to get lustre to stop trying to open > 0xf150010:80d24629: ? And not go though recovery?With the exception of the CATALOGS file, I''m not sure what else would be looking for an llog like this. You could try touching that filename in OBJECTS, but I''m not sure that would help.> If not, can I format a new mds, and just untar ROOTS/ and apply > the extended attributes to ROOTS from the old mds filesystem?Well, it depends on where the request is coming from. Deleting the CATALOGS file should be essetially the same (and much faster) than what you propose above.> On Aug 19, 2009, at 12:57 PM, Brock Palen wrote: > > After a network event (switches bouncing) looks like our mds got > > borked somewhere, from all the random failovers (switches came up and > > down rapidly over a few hours). > > > > Now we can not mount the mds, when we do we get the following errors: > > > > Aug 19 12:37:39 mds2 kernel: LustreError: 137-5: UUID ''nobackup- > > MDT0000_UUID'' is not available for connect (no target) > > Aug 19 12:37:39 mds2 kernel: LustreError: 7455:0:(ldlm_lib.c: > > 1619:target_send_reply_msg()) @@@ processing error (-19) > > req at 000001037c9db600 x85226/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl > > 1250699959 ref 1 fl Interpret:/0/0 rc -19/0 > > Aug 19 12:37:39 mds2 kernel: LustreError: 137-5: UUID ''nobackup- > > MDT0000_UUID'' is not available for connect (no target) > > Aug 19 12:37:39 mds2 kernel: LustreError: 7456:0:(ldlm_lib.c: > > 1619:target_send_reply_msg()) @@@ processing error (-19) > > req at 00000104163a6000 x47117/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl > > 1250699959 ref 1 fl Interpret:/0/0 rc -19/0 > > Aug 19 12:37:39 mds2 kernel: LustreError: 137-5: UUID ''nobackup- > > MDT0000_UUID'' is not available for connect (no target)Aug 19 12:37:39 > > mds2 kernel: LustreError: Skipped 11 previous similar messages > > Aug 19 12:37:39 mds2 kernel: LustreError: 7468:0:(ldlm_lib.c: > > 1619:target_send_reply_msg()) @@@ processing error (-19) > > req at 0000010350a4d200 x81788/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl > > 1250699959 ref 1 fl Interpret:/0/0 rc -19/0 > > Aug 19 12:37:39 mds2 kernel: LustreError: 7468:0:(ldlm_lib.c: > > 1619:target_send_reply_msg()) Skipped 11 previous similar messages > > Aug 19 12:37:40 mds2 kernel: LustreError: 137-5: UUID ''nobackup- > > MDT0000_UUID'' is not available for connect (no target) > > Aug 19 12:37:40 mds2 kernel: LustreError: Skipped 18 previous similar > > messages > > Aug 19 12:37:40 mds2 kernel: LustreError: 7455:0:(ldlm_lib.c: > > 1619:target_send_reply_msg()) @@@ processing error (-19) > > req at 0000010414dc1850 x81855/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl > > 1250699960 ref 1 fl Interpret:/0/0 rc -19/0Aug 19 12:37:40 mds2 > > kernel: LustreError: 7455:0:(ldlm_lib.c:1619:target_send_reply_msg()) > > Skipped 18 previous similar messages > > Aug 19 12:37:42 mds2 kernel: LustreError: 137-5: UUID ''nobackup- > > MDT0000_UUID'' is not available for connect (no target) > > Aug 19 12:37:42 mds2 kernel: LustreError: Skipped 42 previous similar > > messages > > Aug 19 12:37:42 mds2 kernel: LustreError: 7466:0:(ldlm_lib.c: > > 1619:target_send_reply_msg()) @@@ processing error (-19) > > req at 000001037c9db600 x77144/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl > > 1250699962 ref 1 fl Interpret:/0/0 rc -19/0 > > Aug 19 12:37:42 mds2 kernel: LustreError: 7466:0:(ldlm_lib.c: > > 1619:target_send_reply_msg()) Skipped 42 previous similar messages > > Aug 19 12:37:43 mds2 kernel: Lustre: Request x3 sent from > > MGC10.164.3.246 at tcp to NID 10.164.3.246 at tcp 5s ago has timed out > > (limit 5s). > > Aug 19 12:37:43 mds2 kernel: Lustre: Changing connection for > > MGC10.164.3.246 at tcp to MGC10.164.3.246 at tcp_1/0 at lo > > Aug 19 12:37:43 mds2 kernel: Lustre: Enabling user_xattr > > Aug 19 12:37:43 mds2 kernel: Lustre: 7524:0:(mds_fs.c: > > 493:mds_init_server_data()) RECOVERY: service nobackup-MDT0000, 439 > > recoverable clients, last_transno 3647966566 > > Aug 19 12:37:43 mds2 kernel: Lustre: MDT nobackup-MDT0000 now serving > > dev (nobackup-MDT0000/57dddb69-2475-b551-4100-e045f91ce38c), but will > > be in recovery for at least 5:00, or > > until 439 clients reconnect. During this time new clients will not be > > allowed to connect. Recovery progress can be monitored by watching / > > proc/fs/lustre/mds/nobackup-MDT0000/rec > > overy_status. > > Aug 19 12:37:43 mds2 kernel: Lustre: 7524:0:(lproc_mds.c: > > 273:lprocfs_wr_group_upcall()) nobackup-MDT0000: group upcall set to / > > usr/sbin/l_getgroups > > Aug 19 12:37:43 mds2 kernel: Lustre: nobackup-MDT0000.mdt: set > > parameter group_upcall=/usr/sbin/l_getgroupsAug 19 12:37:43 mds2 > > kernel: Lustre: 7524:0:(mds_lov.c:1070:mds_notify()) MDS nobackup- > > MDT0000: in recovery, not resetting orphans on nobackup-OST0000_UUID > > Aug 19 12:37:43 mds2 kernel: Lustre: nobackup-MDT0000: temporarily > > refusing client connection from 10.164.1.104 at tcp > > Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_lvfs.c: > > 612:llog_lvfs_create()) error looking up logfile 0xf150010:0x80d24629: > > rc -2 > > Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_cat.c: > > 176:llog_cat_id2handle()) error opening log id 0xf150010:80d24629: > > rc -2 > > Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_obd.c: > > 262:cat_cancel_cb()) Cannot find handle for log 0xf150010 > > Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(llog_obd.c: > > 329:llog_obd_origin_setup()) llog_process with cat_cancel_cb failed: > > -2 > > Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(osc_request.c: > > 3664:osc_llog_init()) failed LLOG_MDS_OST_ORIG_CTXT > > Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(osc_request.c: > > 3675:osc_llog_init()) osc ''nobackup-OST0000-osc'' tgt ''nobackup- > > MDT0000'' cnt 1 catid 00000101e1d979e8 rc=-2 > > Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(osc_request.c: > > 3677:osc_llog_init()) logid 0xf150002:0x9642a0ac > > Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(lov_log.c: > > 230:lov_llog_init()) error osc_llog_init idx 0 osc ''nobackup-OST0000- > > osc'' tgt ''nobackup-MDT0000'' (rc=-2) > > Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(mds_log.c: > > 220:mds_llog_init()) lov_llog_init err -2 > > Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(llog_obd.c: > > 417:llog_cat_initialize()) rc: -2 > > Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(lov_obd.c: > > 727:lov_add_target()) add failed (-2), deleting nobackup-OST0000_UUID > > Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(obd_config.c: > > 1093:class_config_llog_handler()) Err -2 on cfg command: > > Aug 19 12:37:43 mds2 kernel: Lustre: cmd=cf00d 0:nobackup-mdtlov > > 1:nobackup-OST0000_UUID 2:0 3:1 > > Aug 19 12:37:43 mds2 kernel: LustreError: 15c-8: MGC10.164.3.246 at tcp: > > The configuration from log ''nobackup-MDT0000'' failed (-2). This may be > > the result of communication errors b > > etween this node and the MGS, a bad configuration, or other errors. > > See the syslog for more information. > > Aug 19 12:37:43 mds2 kernel: LustreError: 7438:0:(obd_mount.c: > > 1113:server_start_targets()) failed to start server nobackup- > > MDT0000: -2 > > Aug 19 12:37:44 mds2 kernel: LustreError: 7438:0:(obd_mount.c: > > 1623:server_fill_super()) Unable to start targets: -2 > > Aug 19 12:37:44 mds2 kernel: Lustre: Failing over nobackup-MDT0000 > > Aug 19 12:37:44 mds2 kernel: Lustre: *** setting obd nobackup-MDT0000 > > device ''unknown-block(8,16)'' read-only *** > > > > We have ran e2fsck on the volume, found a few errors and corrected. > > But the problem presists. We also tried mounting with -o abort_recov > > this resulted in a assertion (lbug) and does not work. > > ANy thoughts? The lines: > > Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_lvfs.c: > > 612:llog_lvfs_create()) error looking up logfile 0xf150010:0x80d24629: > > rc -2 > > Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_cat.c: > > 176:llog_cat_id2handle()) error opening log id 0xf150010:80d24629: > > rc -2 > > Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_obd.c: > > 262:cat_cancel_cb()) Cannot find handle for log 0xf150010 > > > > Catch my attention, > > Thanks, we are running 1.6.6 > > > > > > Brock Palen > > www.umich.edu/~brockp > > Center for Advanced Computing > > brockp at umich.edu > > (734)936-1985 > > > > > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussCheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Andreas, sorry I missed your reply yesterday. Here is how we fixed it, We deleted OBJECTS/* and CATALOGS, shutdown all the ost''s At this point the mds mounted correctly with -o abort_recov Remount ost''s (with recovery) and all worked well, I have sense enabled (re)quotas, heartbeat and bounced servers a few times. All appears well. Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp at umich.edu (734)936-1985 On Aug 20, 2009, at 9:09 AM, Brock Palen wrote:> Some additional details, > I mounted the mds as ldiskfs and deleted the files in OBJECTS/* and > CATALOGS, > Remounted as lustre, same issue. > I also did a write conf, restarted all the servers, saw messages on > the MGS, that new config logs were being created, but still same error > on the mds trying to start up. > Is there a way to get lustre to stop trying to open > 0xf150010:80d24629: ? And not go though recovery? > > If not, can I format a new mds, and just untar ROOTS/ and apply > the extended attributes to ROOTS from the old mds filesystem? > > Brock Palen > www.umich.edu/~brockp > Center for Advanced Computing > brockp at umich.edu > (734)936-1985 > > > > On Aug 19, 2009, at 12:57 PM, Brock Palen wrote: > >> After a network event (switches bouncing) looks like our mds got >> borked somewhere, from all the random failovers (switches came up and >> down rapidly over a few hours). >> >> Now we can not mount the mds, when we do we get the following >> errors: >> >> Aug 19 12:37:39 mds2 kernel: LustreError: 137-5: UUID ''nobackup- >> MDT0000_UUID'' is not available for connect (no target) >> Aug 19 12:37:39 mds2 kernel: LustreError: 7455:0:(ldlm_lib.c: >> 1619:target_send_reply_msg()) @@@ processing error (-19) >> req at 000001037c9db600 x85226/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 >> dl >> 1250699959 ref 1 fl Interpret:/0/0 rc -19/0 >> Aug 19 12:37:39 mds2 kernel: LustreError: 137-5: UUID ''nobackup- >> MDT0000_UUID'' is not available for connect (no target) >> Aug 19 12:37:39 mds2 kernel: LustreError: 7456:0:(ldlm_lib.c: >> 1619:target_send_reply_msg()) @@@ processing error (-19) >> req at 00000104163a6000 x47117/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 >> dl >> 1250699959 ref 1 fl Interpret:/0/0 rc -19/0 >> Aug 19 12:37:39 mds2 kernel: LustreError: 137-5: UUID ''nobackup- >> MDT0000_UUID'' is not available for connect (no target)Aug 19 >> 12:37:39 >> mds2 kernel: LustreError: Skipped 11 previous similar messages >> Aug 19 12:37:39 mds2 kernel: LustreError: 7468:0:(ldlm_lib.c: >> 1619:target_send_reply_msg()) @@@ processing error (-19) >> req at 0000010350a4d200 x81788/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 >> dl >> 1250699959 ref 1 fl Interpret:/0/0 rc -19/0 >> Aug 19 12:37:39 mds2 kernel: LustreError: 7468:0:(ldlm_lib.c: >> 1619:target_send_reply_msg()) Skipped 11 previous similar messages >> Aug 19 12:37:40 mds2 kernel: LustreError: 137-5: UUID ''nobackup- >> MDT0000_UUID'' is not available for connect (no target) >> Aug 19 12:37:40 mds2 kernel: LustreError: Skipped 18 previous similar >> messages >> Aug 19 12:37:40 mds2 kernel: LustreError: 7455:0:(ldlm_lib.c: >> 1619:target_send_reply_msg()) @@@ processing error (-19) >> req at 0000010414dc1850 x81855/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 >> dl >> 1250699960 ref 1 fl Interpret:/0/0 rc -19/0Aug 19 12:37:40 mds2 >> kernel: LustreError: 7455:0:(ldlm_lib.c:1619:target_send_reply_msg()) >> Skipped 18 previous similar messages >> Aug 19 12:37:42 mds2 kernel: LustreError: 137-5: UUID ''nobackup- >> MDT0000_UUID'' is not available for connect (no target) >> Aug 19 12:37:42 mds2 kernel: LustreError: Skipped 42 previous similar >> messages >> Aug 19 12:37:42 mds2 kernel: LustreError: 7466:0:(ldlm_lib.c: >> 1619:target_send_reply_msg()) @@@ processing error (-19) >> req at 000001037c9db600 x77144/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 >> dl >> 1250699962 ref 1 fl Interpret:/0/0 rc -19/0 >> Aug 19 12:37:42 mds2 kernel: LustreError: 7466:0:(ldlm_lib.c: >> 1619:target_send_reply_msg()) Skipped 42 previous similar messages >> Aug 19 12:37:43 mds2 kernel: Lustre: Request x3 sent from >> MGC10.164.3.246 at tcp to NID 10.164.3.246 at tcp 5s ago has timed out >> (limit 5s). >> Aug 19 12:37:43 mds2 kernel: Lustre: Changing connection for >> MGC10.164.3.246 at tcp to MGC10.164.3.246 at tcp_1/0 at lo >> Aug 19 12:37:43 mds2 kernel: Lustre: Enabling user_xattr >> Aug 19 12:37:43 mds2 kernel: Lustre: 7524:0:(mds_fs.c: >> 493:mds_init_server_data()) RECOVERY: service nobackup-MDT0000, 439 >> recoverable clients, last_transno 3647966566 >> Aug 19 12:37:43 mds2 kernel: Lustre: MDT nobackup-MDT0000 now serving >> dev (nobackup-MDT0000/57dddb69-2475-b551-4100-e045f91ce38c), but will >> be in recovery for at least 5:00, or >> until 439 clients reconnect. During this time new clients will not be >> allowed to connect. Recovery progress can be monitored by watching / >> proc/fs/lustre/mds/nobackup-MDT0000/rec >> overy_status. >> Aug 19 12:37:43 mds2 kernel: Lustre: 7524:0:(lproc_mds.c: >> 273:lprocfs_wr_group_upcall()) nobackup-MDT0000: group upcall set >> to / >> usr/sbin/l_getgroups >> Aug 19 12:37:43 mds2 kernel: Lustre: nobackup-MDT0000.mdt: set >> parameter group_upcall=/usr/sbin/l_getgroupsAug 19 12:37:43 mds2 >> kernel: Lustre: 7524:0:(mds_lov.c:1070:mds_notify()) MDS nobackup- >> MDT0000: in recovery, not resetting orphans on nobackup-OST0000_UUID >> Aug 19 12:37:43 mds2 kernel: Lustre: nobackup-MDT0000: temporarily >> refusing client connection from 10.164.1.104 at tcp >> Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_lvfs.c: >> 612:llog_lvfs_create()) error looking up logfile >> 0xf150010:0x80d24629: >> rc -2 >> Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_cat.c: >> 176:llog_cat_id2handle()) error opening log id 0xf150010:80d24629: >> rc -2 >> Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_obd.c: >> 262:cat_cancel_cb()) Cannot find handle for log 0xf150010 >> Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(llog_obd.c: >> 329:llog_obd_origin_setup()) llog_process with cat_cancel_cb failed: >> -2 >> Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(osc_request.c: >> 3664:osc_llog_init()) failed LLOG_MDS_OST_ORIG_CTXT >> Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(osc_request.c: >> 3675:osc_llog_init()) osc ''nobackup-OST0000-osc'' tgt ''nobackup- >> MDT0000'' cnt 1 catid 00000101e1d979e8 rc=-2 >> Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(osc_request.c: >> 3677:osc_llog_init()) logid 0xf150002:0x9642a0ac >> Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(lov_log.c: >> 230:lov_llog_init()) error osc_llog_init idx 0 osc ''nobackup-OST0000- >> osc'' tgt ''nobackup-MDT0000'' (rc=-2) >> Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(mds_log.c: >> 220:mds_llog_init()) lov_llog_init err -2 >> Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(llog_obd.c: >> 417:llog_cat_initialize()) rc: -2 >> Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(lov_obd.c: >> 727:lov_add_target()) add failed (-2), deleting nobackup-OST0000_UUID >> Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(obd_config.c: >> 1093:class_config_llog_handler()) Err -2 on cfg command: >> Aug 19 12:37:43 mds2 kernel: Lustre: cmd=cf00d 0:nobackup-mdtlov >> 1:nobackup-OST0000_UUID 2:0 3:1 >> Aug 19 12:37:43 mds2 kernel: LustreError: 15c-8: MGC10.164.3.246 at tcp: >> The configuration from log ''nobackup-MDT0000'' failed (-2). This may >> be >> the result of communication errors b >> etween this node and the MGS, a bad configuration, or other errors. >> See the syslog for more information. >> Aug 19 12:37:43 mds2 kernel: LustreError: 7438:0:(obd_mount.c: >> 1113:server_start_targets()) failed to start server nobackup- >> MDT0000: -2 >> Aug 19 12:37:44 mds2 kernel: LustreError: 7438:0:(obd_mount.c: >> 1623:server_fill_super()) Unable to start targets: -2 >> Aug 19 12:37:44 mds2 kernel: Lustre: Failing over nobackup-MDT0000 >> Aug 19 12:37:44 mds2 kernel: Lustre: *** setting obd nobackup-MDT0000 >> device ''unknown-block(8,16)'' read-only *** >> >> We have ran e2fsck on the volume, found a few errors and corrected. >> But the problem presists. We also tried mounting with -o abort_recov >> this resulted in a assertion (lbug) and does not work. >> ANy thoughts? The lines: >> Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_lvfs.c: >> 612:llog_lvfs_create()) error looking up logfile >> 0xf150010:0x80d24629: >> rc -2 >> Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_cat.c: >> 176:llog_cat_id2handle()) error opening log id 0xf150010:80d24629: >> rc -2 >> Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_obd.c: >> 262:cat_cancel_cb()) Cannot find handle for log 0xf150010 >> >> Catch my attention, >> Thanks, we are running 1.6.6 >> >> >> Brock Palen >> www.umich.edu/~brockp >> Center for Advanced Computing >> brockp at umich.edu >> (734)936-1985 >> >> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >