thr3ads.net - Lustre discuss - [Lustre-discuss] Removing an OST [Jan 2007]

If this information is useful, please help other people find it:
Share via:

Jeremy Mann

2007-Jan-31 10:40 UTC

[Lustre-discuss] Removing an OST

I have to remove an OST because of a drive problem, however if I unmount
the lustre OST filesystem, the entire Lustre filesystem hangs. Is there a
safe way to remove an OST from the MDT server? We''re using the beta7
release.


-- 
Jeremy Mann
jeremy@biochem.uthscsa.edu

University of Texas Health Science Center
Bioinformatics Core Facility
http://www.bioinformatics.uthscsa.edu
Phone: (210) 567-2672

Nathaniel Rutman

2007-Jan-31 11:25 UTC

head link

[Lustre-discuss] Removing an OST

Jeremy Mann wrote:> I have to remove an OST because of a drive problem, however if I unmount
> the lustre OST filesystem, the entire Lustre filesystem hangs. Is there a
> safe way to remove an OST from the MDT server? We''re using the
beta7
> release.
>
>
>   You have to deactivate the OSCs that reference that OST.  From 
https://mail.clusterfs.com/wikis/lustre/MountConf:

As of beta7, an OST can be permanently removed from a filesystem. Note 
that any files that have stripes on the removed OST will henceforth 
return EIO.

mgs> lctl conf_param testfs-OST0001.osc.active=0

Jeremy Mann

2007-Jan-31 13:29 UTC

head link

[Lustre-discuss] Removing an OST

Nathaniel Rutman wrote:
> You have to deactivate the OSCs that reference that OST.  From
> https://mail.clusterfs.com/wikis/lustre/MountConf:
>
> As of beta7, an OST can be permanently removed from a filesystem. Note
> that any files that have stripes on the removed OST will henceforth
> return EIO.
>
> mgs> lctl conf_param testfs-OST0001.osc.active=0
Thanks Nathaniel I found it shortly after posting the message. However,
maybe I didn''t do it right but I still get error messages about this
node.
The steps I took were:

1. umounted /lustre from the frontend
2. umounted /lustre-storage on node17
3. on frontend, ran lctl conf_param bcffs-OST0002.osc.active=0
4. on frontend, remounted bcffs with:

mount -o exclude=bcffs-OST0002 -t lustre bcf@tcp0:/bcffs /lustre

dmesg shows:

Lustre: 31233:0:(quota_master.c:1105:mds_quota_recovery()) Not all osts
are active, abort quota recovery
Lustre: MDS bcffs-MDT0000: bcffs-OST000c_UUID now active, resetting orphans
LustreError: 32542:0:(file.c:1012:ll_glimpse_size()) obd_enqueue returned
rc -5, returning -EIO
Lustre: client 000001017ac3fc00 umount complete
Lustre: 1257:0:(obd_mount.c:1675:lustre_check_exclusion()) Excluding
bcffs-OST0002-osc (on exclusion list)
Lustre: 1257:0:(recover.c:231:ptlrpc_set_import_active()) setting import
bcffs-OST0002_UUID INACTIVE by administrator request
Lustre: osc.: set active=0 to 0
LustreError: 1257:0:(lov_obd.c:139:lov_connect_obd()) not connecting OSC
bcffs-OST0002_UUID; administratively disabled
Lustre: Client bcffs-client has started
Lustre: 2484:0:(quota_master.c:1105:mds_quota_recovery()) Not all osts are
active, abort quota recovery
Lustre: MDS bcffs-MDT0000: bcffs-OST000d_UUID now active, resetting orphans
Lustre: MGS: haven''t heard from client
454dd520-82b9-e3e6-8fcb-800a75807121 (at 192.168.1.237@tcp) in 228
seconds. I think it''s dead, and I am evicting it.
LustreError: 4834:0:(file.c:1012:ll_glimpse_size()) obd_enqueue returned
rc -5, returning -EIO

Are these normal error messages? I''m asking because I''m about
to copy all
of the NCBI databases to the lustre filesystem. I don''t want to start
it,
then have Lustre crash and have to rebuild everything all over again minus
this node.

-- 
Jeremy Mann
jeremy@biochem.uthscsa.edu

University of Texas Health Science Center
Bioinformatics Core Facility
http://www.bioinformatics.uthscsa.edu
Phone: (210) 567-2672

Nathaniel Rutman

2007-Jan-31 14:04 UTC

head link

[Lustre-discuss] Removing an OST

Jeremy Mann wrote:> Nathaniel Rutman wrote:
>
>   
>> You have to deactivate the OSCs that reference that OST.  From
>> https://mail.clusterfs.com/wikis/lustre/MountConf:
>>
>> As of beta7, an OST can be permanently removed from a filesystem. Note
>> that any files that have stripes on the removed OST will henceforth
>> return EIO.
>>
>> mgs> lctl conf_param testfs-OST0001.osc.active=0
>>     
>
> Thanks Nathaniel I found it shortly after posting the message. However,
> maybe I didn''t do it right but I still get error messages about
this node.
> The steps I took were:
>
> 1. umounted /lustre from the frontend
> 2. umounted /lustre-storage on node17
> 3. on frontend, ran lctl conf_param bcffs-OST0002.osc.active=0
> 4. on frontend, remounted bcffs with:
>
> mount -o exclude=bcffs-OST0002 -t lustre bcf@tcp0:/bcffs /lustre
>
>   All you need is step 3 on the MGS.  You don''t have to remount clients
or
use -o exclude.  But that won''t hurt anything.

> dmesg shows:
>
> Lustre: 31233:0:(quota_master.c:1105:mds_quota_recovery()) Not all osts
> are active, abort quota recovery
>   Apparently quotas don''t work with deactivated OSTs.  I don''t
know much
about this area, but I suspect that to get rid of these messages,
you''ll
need to have everything active.> Lustre: MDS bcffs-MDT0000: bcffs-OST000c_UUID now active, resetting orphans
>   
This looks like you restarted the MDT.> LustreError: 32542:0:(file.c:1012:ll_glimpse_size()) obd_enqueue returned
> rc -5, returning -EIO
> Lustre: client 000001017ac3fc00 umount complete
>   
And here the client stopped> Lustre: 1257:0:(obd_mount.c:1675:lustre_check_exclusion()) Excluding
> bcffs-OST0002-osc (on exclusion list)
> Lustre: 1257:0:(recover.c:231:ptlrpc_set_import_active()) setting import
> bcffs-OST0002_UUID INACTIVE by administrator request
>   
That''s the -o exlude> Lustre: osc.: set active=0 to 0
> LustreError: 1257:0:(lov_obd.c:139:lov_connect_obd()) not connecting OSC
> bcffs-OST0002_UUID; administratively disabled
>   And that''s the osc.active=0.  You only need one or the other, but both 
won''t break anything.> Lustre: Client bcffs-client has started
>   
And here the client restarted> Lustre: 2484:0:(quota_master.c:1105:mds_quota_recovery()) Not all osts are
> active, abort quota recovery
> Lustre: MDS bcffs-MDT0000: bcffs-OST000d_UUID now active, resetting orphans
> Lustre: MGS: haven''t heard from client
> 454dd520-82b9-e3e6-8fcb-800a75807121 (at 192.168.1.237@tcp) in 228
> seconds. I think it''s dead, and I am evicting it.
>   The MGS eviction of an MGC is a non-destructive event, and I should turn 
off this scary message.  The MGC will re-acquire an MGS lock later.  
This happened here because (I theorize) that you had a combined MGS/MDT 
that you restarted, while other Lustre devices were still mounted on the 
same node.  The restart of the MGS means that all live MGC''s must get 
kicked out and reconnect.> LustreError: 4834:0:(file.c:1012:ll_glimpse_size()) obd_enqueue returned
> rc -5, returning -EIO
>   This is the only potential worrying error message, but depending on who 
generated it, it might be fine.> Are these normal error messages? I''m asking because I''m
about to copy all
> of the NCBI databases to the lustre filesystem. I don''t want to
start it,
> then have Lustre crash and have to rebuild everything all over again minus
> this node.
>
>   You can use the "writeconf" procedure described on the wiki to remove 
all traces of the removed OST.  This does not require reformatting 
anything and will likely fix the quota message.

Lustre discuss - Jan 2007 - Removing an OST

[Lustre-discuss] Removing an OST

[Lustre-discuss] Removing an OST

[Lustre-discuss] Removing an OST

[Lustre-discuss] Removing an OST