Hi, we''re using 1.6.4.2 here. 5 days ago we lost an OST block device forever. I deactivated this OST, recreated the block device and started "mkfs.lustre --reformat ..." on this device and gave it the same OST id (OST0010) as before. A mount of this OST went wrong with the "already in use" message. Therefore I reformatted the device again with no index number passed. But now, when I try to mount the clients I''ve got the following in dmesg: Lustre: 1837:0:(obd_mount.c:1685:lustre_check_exclusion()) Excluding chicfs-OST0010-osc (on exclusion list) Lustre: setting import chicfs-OST0010_UUID INACTIVE by administrator request Lustre: chicfs-OST0010-osc-00000100cfa0f800.osc: set parameter active=0 LustreError: 1837:0:(lov_obd.c:140:lov_connect_obd()) not connecting OSC chicfs-OST0010_UUID; administratively disabled Lustre: Client chicfs-client has started ib0: no IPv6 routers present LustreError: 2596:0:(client.c:504:ptlrpc_import_delay_req()) @@@ Uninitialized import. req at 00000100cfb91600 x274/t0 o400->chicfs-OST0010_UUID@<NULL>:6 lens 64/64 ref 1 fl Rpc:N/0/0 rc 0/0 LustreError: 2596:0:(client.c:506:ptlrpc_import_delay_req()) LBUG Lustre: 2596:0:(linux-debug.c:168:libcfs_debug_dumpstack()) showing stack for process 2596 lfs R running task 0 2596 2594 (NOTLB) 00000100715f9ca8 0000000000000000 0000000000000000 00000100718d6740 00000100715f9e78 00000100715f9ca8 00000100715f9cb8 000001007e32c800 ffffffffa031e990 00000000000001fa Call Trace:<ffffffff80148b4b>{__kernel_text_address+26} <ffffffff801115c0>{show_trace+375} <ffffffff801116fc>{show_stack+241} <ffffffffa01e39c3>{:libcfs:lbug_with_loc+115} <ffffffffa02ebc4e>{:ptlrpc:ptlrpc_import_delay_req+238} <ffffffffa02f05e8>{:ptlrpc:ptlrpc_queue_wait+584} <ffffffffa02f9bb3>{:ptlrpc:lustre_pack_request+995} <ffffffffa02eb9f8>{:ptlrpc:ptlrpc_prep_req_pool+1832} <ffffffff80178f93>{file_move+27} <ffffffff80177540>{dentry_open_it+284} <ffffffffa031ceac>{:ptlrpc:lprocfs_wr_ping+444} <ffffffff803211ef>{__down_read+52} <ffffffffa0266a15>{:obdclass:lprocfs_fops_write+117} <ffffffff8017821a>{vfs_write+207} <ffffffff80178302>{sys_write+69} <ffffffff8011022a>{system_call+126} Any hints are appreciated. Is there a way to fully remove the OST0010 from the configuration? Thanks, Frank -- Dipl.-Inf. Frank Mietke | Fakult?tsrechen- und Informationszentrum Tel.: 0371 - 531 - 35538 | Fak. f?r Informatik Fax: 0371 - 531 8 35538 | TU-Chemnitz Key-ID: 60F59599 | frank.mietke at informatik.tu-chemnitz.de
Hi, an update for the problem described below. The command "lfs check servers" which accesses the /proc filesystem entry of OST0010 seems to be the cause of the trace. If I try to make a "cat /proc/.../chicfs-OST0010.../ost_conn_uuid" this process is hanging. On the MDS it is possible to perform this command without problems. Frank On Wed, Apr 30, 2008 at 11:03:11AM +0200, Frank Mietke wrote:> Hi, > > we''re using 1.6.4.2 here. 5 days ago we lost an OST block device forever. I > deactivated this OST, recreated the block device and started "mkfs.lustre --reformat ..." on this > device and gave it the same OST id (OST0010) as before. A mount of this OST went wrong > with the "already in use" message. Therefore I reformatted the device again with > no index number passed. But now, when I try to mount the clients I''ve got the > following in dmesg: > > Lustre: 1837:0:(obd_mount.c:1685:lustre_check_exclusion()) Excluding > chicfs-OST0010-osc (on exclusion list) > Lustre: setting import chicfs-OST0010_UUID INACTIVE by administrator request > Lustre: chicfs-OST0010-osc-00000100cfa0f800.osc: set parameter active=0 > LustreError: 1837:0:(lov_obd.c:140:lov_connect_obd()) not connecting OSC > chicfs-OST0010_UUID; administratively disabled > Lustre: Client chicfs-client has started > ib0: no IPv6 routers present > LustreError: 2596:0:(client.c:504:ptlrpc_import_delay_req()) @@@ Uninitialized > import. req at 00000100cfb91600 x274/t0 o400->chicfs-OST0010_UUID@<NULL>:6 lens > 64/64 ref 1 fl Rpc:N/0/0 rc 0/0 > LustreError: 2596:0:(client.c:506:ptlrpc_import_delay_req()) LBUG > Lustre: 2596:0:(linux-debug.c:168:libcfs_debug_dumpstack()) showing stack for > process 2596 > lfs R running task 0 2596 2594 (NOTLB) > 00000100715f9ca8 0000000000000000 0000000000000000 00000100718d6740 > 00000100715f9e78 00000100715f9ca8 00000100715f9cb8 000001007e32c800 > ffffffffa031e990 00000000000001fa > Call Trace:<ffffffff80148b4b>{__kernel_text_address+26} > <ffffffff801115c0>{show_trace+375} > <ffffffff801116fc>{show_stack+241} > <ffffffffa01e39c3>{:libcfs:lbug_with_loc+115} > <ffffffffa02ebc4e>{:ptlrpc:ptlrpc_import_delay_req+238} > <ffffffffa02f05e8>{:ptlrpc:ptlrpc_queue_wait+584} > <ffffffffa02f9bb3>{:ptlrpc:lustre_pack_request+995} > <ffffffffa02eb9f8>{:ptlrpc:ptlrpc_prep_req_pool+1832} > <ffffffff80178f93>{file_move+27} <ffffffff80177540>{dentry_open_it+284} > <ffffffffa031ceac>{:ptlrpc:lprocfs_wr_ping+444} > <ffffffff803211ef>{__down_read+52} > <ffffffffa0266a15>{:obdclass:lprocfs_fops_write+117} > <ffffffff8017821a>{vfs_write+207} <ffffffff80178302>{sys_write+69} > <ffffffff8011022a>{system_call+126} > > Any hints are appreciated. Is there a way to fully remove the OST0010 from the > configuration? > > Thanks, > Frank > > > > -- > Dipl.-Inf. Frank Mietke | Fakult?tsrechen- und Informationszentrum > Tel.: 0371 - 531 - 35538 | Fak. f?r Informatik > Fax: 0371 - 531 8 35538 | TU-Chemnitz > Key-ID: 60F59599 | frank.mietke at informatik.tu-chemnitz.de > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- Dipl.-Inf. Frank Mietke | Fakult?tsrechen- und Informationszentrum Tel.: 0371 - 531 - 35538 | Fak. f?r Informatik Fax: 0371 - 531 8 35538 | TU-Chemnitz Key-ID: 60F59599 | frank.mietke at informatik.tu-chemnitz.de
>From what I have been able to gather, this is not possible at the moment.The dead OST will always be there. There is no functioning way to actually remove it at the moment. https://bugzilla.lustre.org/show_bug.cgi?id=15345 We are running in failout mode rather than failover. When our new clients tried to reconnect our test cluster, they appeared to block forever. I would need to recreate the situation to validate that is the behavior, but I am dealing with another issue at the moment. If you are on a production cluster, you may be in a bad way. The only way I have found to recover this is to wipe the cluster and start fresh. (Not a good option.) -- Andrew> -----Original Message----- > From: lustre-discuss-bounces at lists.lustre.org > [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of > Frank Mietke > Sent: Wednesday, April 30, 2008 3:03 AM > To: lustre-discuss at lists.lustre.org > Subject: [Lustre-discuss] Lustre mount problem > > Hi, > > we''re using 1.6.4.2 here. 5 days ago we lost an OST block > device forever. I > deactivated this OST, recreated the block device and started > "mkfs.lustre --reformat ..." on this > device and gave it the same OST id (OST0010) as before. A > mount of this OST went wrong > with the "already in use" message. Therefore I reformatted > the device again with > no index number passed. But now, when I try to mount the > clients I''ve got the > following in dmesg: > > Lustre: 1837:0:(obd_mount.c:1685:lustre_check_exclusion()) Excluding > chicfs-OST0010-osc (on exclusion list) > Lustre: setting import chicfs-OST0010_UUID INACTIVE by > administrator request > Lustre: chicfs-OST0010-osc-00000100cfa0f800.osc: set > parameter active=0 > LustreError: 1837:0:(lov_obd.c:140:lov_connect_obd()) not > connecting OSC > chicfs-OST0010_UUID; administratively disabled > Lustre: Client chicfs-client has started > ib0: no IPv6 routers present > LustreError: 2596:0:(client.c:504:ptlrpc_import_delay_req()) > @@@ Uninitialized > import. req at 00000100cfb91600 x274/t0 > o400->chicfs-OST0010_UUID@<NULL>:6 lens > 64/64 ref 1 fl Rpc:N/0/0 rc 0/0 > LustreError: 2596:0:(client.c:506:ptlrpc_import_delay_req()) LBUG > Lustre: 2596:0:(linux-debug.c:168:libcfs_debug_dumpstack()) > showing stack for > process 2596 > lfs R running task 0 2596 2594 > (NOTLB) > 00000100715f9ca8 0000000000000000 0000000000000000 00000100718d6740 > 00000100715f9e78 00000100715f9ca8 00000100715f9cb8 > 000001007e32c800 > ffffffffa031e990 00000000000001fa > Call Trace:<ffffffff80148b4b>{__kernel_text_address+26} > <ffffffff801115c0>{show_trace+375} > <ffffffff801116fc>{show_stack+241} > <ffffffffa01e39c3>{:libcfs:lbug_with_loc+115} > <ffffffffa02ebc4e>{:ptlrpc:ptlrpc_import_delay_req+238} > <ffffffffa02f05e8>{:ptlrpc:ptlrpc_queue_wait+584} > <ffffffffa02f9bb3>{:ptlrpc:lustre_pack_request+995} > <ffffffffa02eb9f8>{:ptlrpc:ptlrpc_prep_req_pool+1832} > <ffffffff80178f93>{file_move+27} > <ffffffff80177540>{dentry_open_it+284} > <ffffffffa031ceac>{:ptlrpc:lprocfs_wr_ping+444} > <ffffffff803211ef>{__down_read+52} > <ffffffffa0266a15>{:obdclass:lprocfs_fops_write+117} > <ffffffff8017821a>{vfs_write+207} > <ffffffff80178302>{sys_write+69} > <ffffffff8011022a>{system_call+126} > > Any hints are appreciated. Is there a way to fully remove the > OST0010 from the > configuration? > > Thanks, > Frank > > > > -- > Dipl.-Inf. Frank Mietke | Fakult?tsrechen- und > Informationszentrum > Tel.: 0371 - 531 - 35538 | Fak. f?r Informatik > Fax: 0371 - 531 8 35538 | TU-Chemnitz > Key-ID: 60F59599 | > frank.mietke at informatik.tu-chemnitz.de > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Hi Andrew, On Wed, Apr 30, 2008 at 08:58:42AM -0600, Lundgren, Andrew wrote:> From what I have been able to gather, this is not possible at the moment. > > The dead OST will always be there. There is no functioning way to actually remove it at the moment. > > https://bugzilla.lustre.org/show_bug.cgi?id=15345thank you for pointing me to this bug report.> > We are running in failout mode rather than failover. When our new clients tried to reconnect our test cluster, they appeared to block forever. I would need to recreate the situation to validate that is the behavior, but I am dealing with another issue at the moment. > > If you are on a production cluster, you may be in a bad way. The only way I have found to recover this is to wipe the cluster and start fresh. (Not a good option.)I could live with a "dead" OST in the configuration but as I''ve written in the update, every call to a proc-entry of this OST on the clients hangs forever. Not really optimal. Thanks, Frank> > -- > Andrew > > > -----Original Message----- > > From: lustre-discuss-bounces at lists.lustre.org > > [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of > > Frank Mietke > > Sent: Wednesday, April 30, 2008 3:03 AM > > To: lustre-discuss at lists.lustre.org > > Subject: [Lustre-discuss] Lustre mount problem > > > > Hi, > > > > we''re using 1.6.4.2 here. 5 days ago we lost an OST block > > device forever. I > > deactivated this OST, recreated the block device and started > > "mkfs.lustre --reformat ..." on this > > device and gave it the same OST id (OST0010) as before. A > > mount of this OST went wrong > > with the "already in use" message. Therefore I reformatted > > the device again with > > no index number passed. But now, when I try to mount the > > clients I''ve got the > > following in dmesg: > > > > Lustre: 1837:0:(obd_mount.c:1685:lustre_check_exclusion()) Excluding > > chicfs-OST0010-osc (on exclusion list) > > Lustre: setting import chicfs-OST0010_UUID INACTIVE by > > administrator request > > Lustre: chicfs-OST0010-osc-00000100cfa0f800.osc: set > > parameter active=0 > > LustreError: 1837:0:(lov_obd.c:140:lov_connect_obd()) not > > connecting OSC > > chicfs-OST0010_UUID; administratively disabled > > Lustre: Client chicfs-client has started > > ib0: no IPv6 routers present > > LustreError: 2596:0:(client.c:504:ptlrpc_import_delay_req()) > > @@@ Uninitialized > > import. req at 00000100cfb91600 x274/t0 > > o400->chicfs-OST0010_UUID@<NULL>:6 lens > > 64/64 ref 1 fl Rpc:N/0/0 rc 0/0 > > LustreError: 2596:0:(client.c:506:ptlrpc_import_delay_req()) LBUG > > Lustre: 2596:0:(linux-debug.c:168:libcfs_debug_dumpstack()) > > showing stack for > > process 2596 > > lfs R running task 0 2596 2594 > > (NOTLB) > > 00000100715f9ca8 0000000000000000 0000000000000000 00000100718d6740 > > 00000100715f9e78 00000100715f9ca8 00000100715f9cb8 > > 000001007e32c800 > > ffffffffa031e990 00000000000001fa > > Call Trace:<ffffffff80148b4b>{__kernel_text_address+26} > > <ffffffff801115c0>{show_trace+375} > > <ffffffff801116fc>{show_stack+241} > > <ffffffffa01e39c3>{:libcfs:lbug_with_loc+115} > > <ffffffffa02ebc4e>{:ptlrpc:ptlrpc_import_delay_req+238} > > <ffffffffa02f05e8>{:ptlrpc:ptlrpc_queue_wait+584} > > <ffffffffa02f9bb3>{:ptlrpc:lustre_pack_request+995} > > <ffffffffa02eb9f8>{:ptlrpc:ptlrpc_prep_req_pool+1832} > > <ffffffff80178f93>{file_move+27} > > <ffffffff80177540>{dentry_open_it+284} > > <ffffffffa031ceac>{:ptlrpc:lprocfs_wr_ping+444} > > <ffffffff803211ef>{__down_read+52} > > <ffffffffa0266a15>{:obdclass:lprocfs_fops_write+117} > > <ffffffff8017821a>{vfs_write+207} > > <ffffffff80178302>{sys_write+69} > > <ffffffff8011022a>{system_call+126} > > > > Any hints are appreciated. Is there a way to fully remove the > > OST0010 from the > > configuration? > > > > Thanks, > > Frank > > > > > > > > -- > > Dipl.-Inf. Frank Mietke | Fakult?tsrechen- und > > Informationszentrum > > Tel.: 0371 - 531 - 35538 | Fak. f?r Informatik > > Fax: 0371 - 531 8 35538 | TU-Chemnitz > > Key-ID: 60F59599 | > > frank.mietke at informatik.tu-chemnitz.de > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > >-- Dipl.-Inf. Frank Mietke | Fakult?tsrechen- und Informationszentrum Tel.: 0371 - 531 - 35538 | Fak. f?r Informatik Fax: 0371 - 531 8 35538 | TU-Chemnitz Key-ID: 60F59599 | frank.mietke at informatik.tu-chemnitz.de
> > I could live with a "dead" OST in the configuration but as > I''ve written in the > update, every call to a proc-entry of this OST on the clients > hangs forever. Not > really optimal.Are you running in failover or failout? -- Andrew
Hello Frank, On Wednesday 30 April 2008 17:47:55 Frank Mietke wrote:> Hi Andrew, > > On Wed, Apr 30, 2008 at 08:58:42AM -0600, Lundgren, Andrew wrote: > > From what I have been able to gather, this is not possible at the moment. > > > > The dead OST will always be there. There is no functioning way to > > actually remove it at the moment. > > > > https://bugzilla.lustre.org/show_bug.cgi?id=15345 > > thank you for pointing me to this bug report. > > > We are running in failout mode rather than failover. When our new > > clients tried to reconnect our test cluster, they appeared to block > > forever. I would need to recreate the situation to validate that is the > > behavior, but I am dealing with another issue at the moment. > > > > If you are on a production cluster, you may be in a bad way. The only > > way I have found to recover this is to wipe the cluster and start fresh. > > (Not a good option.) > > I could live with a "dead" OST in the configuration but as I''ve written in > the update, every call to a proc-entry of this OST on the clients hangs > forever. Not really optimal.you very first approach to set the re-created ost to the old index was actually the right way to go. In you present situation I would create another very very small ost and set it to the old index number. Lustre will again refuse to re-register this ost, but there is way to convice it not to complain. Here is what I already wrote to the list, when I run into the same problem as you: <quote of myself from 2008-02-05 21:45 "Re: [Lustre-discuss] how to recreate an OST?">>Now I mounted the mgs as ldiskfs, and in CONFIGS/ there is no file for the >missing ost. >But now I just found the reason - the failed OST was still activated on the >clients. After deleting CONFIGS/{fsname}-client and remounting as type lustre >again, registering the failed ost works! >I guess one shouldn''t do it this way if one still has important data on the >filesystem ;)</quote> I never got an reply of sun/clusterfs developers if this is the right filesystem, but I have tested it several times and it seems to be the right way to go. Hope it helps, Bernd -- Bernd Schubert Q-Leap Networks GmbH
Hi, On Wed, Apr 30, 2008 at 09:59:14AM -0600, Lundgren, Andrew wrote:> > > > I could live with a "dead" OST in the configuration but as > > I''ve written in the > > update, every call to a proc-entry of this OST on the clients > > hangs forever. Not > > really optimal. > > Are you running in failover or failout?there was no failover node configured for this OST. What''s default, failout? Frank> > -- > Andrew >-- Dipl.-Inf. Frank Mietke | Fakult?tsrechen- und Informationszentrum Tel.: 0371 - 531 - 35538 | Fak. f?r Informatik Fax: 0371 - 531 8 35538 | TU-Chemnitz Key-ID: 60F59599 | frank.mietke at informatik.tu-chemnitz.de
Hi Bernd,> > I could live with a "dead" OST in the configuration but as I''ve written in > > the update, every call to a proc-entry of this OST on the clients hangs > > forever. Not really optimal. > > you very first approach to set the re-created ost to the old index was > actually the right way to go. In you present situation I would create another > very very small ost and set it to the old index number. > > Lustre will again refuse to re-register this ost, but there is way to convice > it not to complain. Here is what I already wrote to the list, when I run into > the same problem as you: > > <quote of myself from 2008-02-05 21:45 "Re: [Lustre-discuss] how to recreate > an OST?"> >...> I never got an reply of sun/clusterfs developers if this is the right > filesystem, but I have tested it several times and it seems to be the right > way to go.thank you. I read this article in the mailing list history but as you mentioned there was no reply from sun/clusterfs. Now I''m creating a Lustre test setup and play a bit around with this before I doing it on our production system. ;) Thanks, Frank> > > Hope it helps, > Bernd > > -- > Bernd Schubert > Q-Leap Networks GmbH >-- Dipl.-Inf. Frank Mietke | Fakult?tsrechen- und Informationszentrum Tel.: 0371 - 531 - 35538 | Fak. f?r Informatik Fax: 0371 - 531 8 35538 | TU-Chemnitz Key-ID: 60F59599 | frank.mietke at informatik.tu-chemnitz.de