thr3ads.net - Lustre discuss - [Lustre-discuss] Cannot get an OST to activate [Sep 2010]

If this information is useful, please help other people find it:
Share via:

Bob Ball

2010-Sep-03 19:01 UTC

[Lustre-discuss] Cannot get an OST to activate

We added a new OSS to our 1.8.4 Lustre installation.  It has 6 OST of 
8.9TB each.  Within a day of having these on-line, one OST stopped 
accepting new files.  I cannot get it to activate.  The other 5 seem fine.

On the MDS "lctl dl" shows it IN, but not UP, and files can be read
from it:
 33 IN osc umt3-OST001d-osc umt3-mdtlov_UUID 5

However, I cannot get it to re-activate:
lctl --device umt3-OST001d-osc activate

This returns no errors, but dmesg on the MDS shows this as a result:
[603128.578862] Lustre: umt3-OST001d-osc: Connection restored to service 
umt3-OST001d using nid 10.10.2.23 at tcp.
[603128.578865] Lustre: Skipped 1 previous similar message
[603128.579251] Lustre: MDS umt3-MDT0000: umt3-OST001d_UUID now active, 
resetting orphans
[603128.579256] Lustre: Skipped 1 previous similar message
[603128.579608] LustreError: 9655:0:(osc_create.c:589:osc_create()) 
umt3-OST001d-osc: oscc recovery failed: -22
[603128.579616] LustreError: 9655:0:(lov_obd.c:1134:lov_clear_orphans()) 
error in orphan recovery on OST idx 29/34: rc = -22
[603128.579623] LustreError: 
9655:0:(mds_lov.c:1057:__mds_lov_synchronize()) umt3-OST001d_UUID failed 
at mds_lov_clear_orphans: -22
[603128.579628] LustreError: 
9655:0:(mds_lov.c:1066:__mds_lov_synchronize()) umt3-OST001d_UUID sync 
failed -22, deactivating

On the OSS itself, I see these related entries appear:
Lustre: 4642:0:(ldlm_lib.c:572:target_handle_reconnect()) umt3-OST001d: 
umt3-mdtlov_UUID reconnecting
Lustre: 4642:0:(ldlm_lib.c:572:target_handle_reconnect()) Skipped 1 
previous similar message
Lustre: umt3-OST001d: received MDS connection from 10.10.1.49 at tcp
Lustre: Skipped 1 previous similar message
LustreError: 4697:0:(filter.c:3172:filter_handle_precreate()) 
umt3-OST001d: ignoring bogus orphan destroy request: obdid 
11309489156331498430 last_id 0

Can anyone tell me what must be done to recover this disk volume?

Thanks,
bob

Bernd Schubert

2010-Sep-03 21:22 UTC

head link

[Lustre-discuss] Cannot get an OST to activate

On Friday, September 03, 2010, Bob Ball wrote:> We added a new OSS to our 1.8.4 Lustre installation.  It has 6 OST of
> 8.9TB each.  Within a day of having these on-line, one OST stopped
> accepting new files.  I cannot get it to activate.  The other 5 seem fine.
> 
> On the MDS "lctl dl" shows it IN, but not UP, and files can be
read from
> it: 33 IN osc umt3-OST001d-osc umt3-mdtlov_UUID 5
> 
> However, I cannot get it to re-activate:
> lctl --device umt3-OST001d-osc activate
> 
[...]

> LustreError: 4697:0:(filter.c:3172:filter_handle_precreate())
> umt3-OST001d: ignoring bogus orphan destroy request: obdid
> 11309489156331498430 last_id 0
> 
> Can anyone tell me what must be done to recover this disk volume?
Check out section 23.3.9 in the Lustre manual ("How to Fix a Bad LAST_ID on
an
OST).

It is on my TODO list to write tool to automatically correct the
"lov_objid",
but as of now I don''t have it yet. Somehow your lov_objid file has a 
completely wrong value for this OST.
Now, when you say "files can be read from it", are you sure there are
already
files on that OST? Because the error message says that the last_id is zero and 
so you should not have a single file on it. If that is also wrong, you will 
need to correct it as well. You can do that manually, or you can use a patched 
e2fsprogs version, that will do that for you

Patches are here:
https://bugzilla.lustre.org/show_bug.cgi?id=22734

Packages can be found on my home page:
http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/e2fsprogs/

If you want to do it automatically, you will need to create a lfsck mdsdb file 
(the hdr file is sufficient, see the lfsck section in the manual) and then you 
will need to run e2fsck for that OST as if you want to create an OSTDB file. 
That will start pass6, and if you then run e2fsck *without* "-n", so
in
correcting mode, it will correct the LAST_ID file to what it finds on disk. 
With "-v" it will also tell you the old and the new value and then you
will
need to put that value properly coded into the MDS lov_objid file.

Be careful and create backups of the lov_objid and LAST_ID files.

Hope it helps,
Bern

-- 
Bernd Schubert
DataDirect Networks

Bernd Schubert

2010-Sep-03 21:35 UTC

head link

[Lustre-discuss] Cannot get an OST to activate

On Friday, September 03, 2010, Bernd Schubert wrote:> On Friday, September 03, 2010, Bob Ball wrote:
> > We added a new OSS to our 1.8.4 Lustre installation.  It has 6 OST of
> > 8.9TB each.  Within a day of having these on-line, one OST stopped
> > accepting new files.  I cannot get it to activate.  The other 5 seem
> > fine.
> > 
> > On the MDS "lctl dl" shows it IN, but not UP, and files can
be read from
> > it: 33 IN osc umt3-OST001d-osc umt3-mdtlov_UUID 5
> > 
> > However, I cannot get it to re-activate:
> > lctl --device umt3-OST001d-osc activate
> 
> [...]
> 
> > LustreError: 4697:0:(filter.c:3172:filter_handle_precreate())
> > umt3-OST001d: ignoring bogus orphan destroy request: obdid
> > 11309489156331498430 last_id 0
> > 
> > Can anyone tell me what must be done to recover this disk volume?
> 
> Check out section 23.3.9 in the Lustre manual ("How to Fix a Bad
LAST_ID on
> an OST).
> 
> It is on my TODO list to write tool to automatically correct the
> "lov_objid", but as of now I don''t have it yet. Somehow
your lov_objid
> file has a completely wrong value for this OST.
> Now, when you say "files can be read from it", are you sure there
are
> already files on that OST? Because the error message says that the last_id
> is zero and so you should not have a single file on it. If that is also
> wrong, you will need to correct it as well. You can do that manually, or
> you can use a patched e2fsprogs version, that will do that for you
> 
> Patches are here:
> https://bugzilla.lustre.org/show_bug.cgi?id=22734
> 
> Packages can be found on my home page:
> http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/e2fsprogs/
> 
> 
> If you want to do it automatically, you will need to create a lfsck mdsdb
> file (the hdr file is sufficient, see the lfsck section in the manual) and
> then you will need to run e2fsck for that OST as if you want to create an
> OSTDB file. That will start pass6, and if you then run e2fsck *without*
> "-n", so in correcting mode, it will correct the LAST_ID file to
what it
> finds on disk. With "-v" it will also tell you the old and the
new value
> and then you will need to put that value properly coded into the MDS
> lov_objid file.
Update for the lov_objd file, actually, if you rename or delete it (rename it 
please, so that you have a backup), the MDS should be able to re-create it 
from OST LAST_ID data. 
So if the troublesome OST has no data yet, it will be very easy, if it already 
has data, you will need to correct the LAST_ID on that OST first.

Cheers,
Bernd


-- 
Bernd Schubert
DataDirect Networks

Bob Ball

2010-Sep-03 22:06 UTC

head link

Re: Cannot get an OST to activate

Thank you, Bern.  "df" claims there is some 442MB of data on the
volume, compared to neighbors with 285GB.  That could well be a
fragment of a single, unsuccessful transfer attempt.  I can run
lfs_find on it though and see what comes back.  Was having problems
earlier, thought I got files back from that command, but other problems
on our cluster confused that result.  We will recheck.

bob

Bernd Schubert wrote:

On Friday, September 03, 2010, Bob Ball wrote:

We added a new OSS to our 1.8.4 Lustre installation.  It has 6 OST of
8.9TB each.  Within a day of having these on-line, one OST stopped
accepting new files.  I cannot get it to activate.  The other 5 seem fine.

On the MDS "lctl dl" shows it IN, but not UP, and files can be read
from
it: 33 IN osc umt3-OST001d-osc umt3-mdtlov_UUID 5

However, I cannot get it to re-activate:
lctl --device umt3-OST001d-osc activate

[...]

LustreError: 4697:0:(filter.c:3172:filter_handle_precreate())
umt3-OST001d: ignoring bogus orphan destroy request: obdid
11309489156331498430 last_id 0

Can anyone tell me what must be done to recover this disk volume?

Check out section 23.3.9 in the Lustre manual ("How to Fix a Bad LAST_ID on
an
OST).

It is on my TODO list to write tool to automatically correct the
"lov_objid",
but as of now I don''t have it yet. Somehow your lov_objid file has a 
completely wrong value for this OST.
Now, when you say "files can be read from it", are you sure there are
already
files on that OST? Because the error message says that the last_id is zero and 
so you should not have a single file on it. If that is also wrong, you will 
need to correct it as well. You can do that manually, or you can use a patched 
e2fsprogs version, that will do that for you

Patches are here:
https://bugzilla.lustre.org/show_bug.cgi?id=22734

Packages can be found on my home page:
http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/e2fsprogs/

If you want to do it automatically, you will need to create a lfsck mdsdb file 
(the hdr file is sufficient, see the lfsck section in the manual) and then you 
will need to run e2fsck for that OST as if you want to create an OSTDB file. 
That will start pass6, and if you then run e2fsck *without* "-n", so
in
correcting mode, it will correct the LAST_ID file to what it finds on disk. 
With "-v" it will also tell you the old and the new value and then you
will
need to put that value properly coded into the MDS lov_objid file.

Be careful and create backups of the lov_objid and LAST_ID files.

Hope it helps,
Bern


_______________________________________________
Lustre-discuss mailing list
Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Bob Ball

2010-Sep-10 14:21 UTC

head link

Re: Cannot get an OST to activate

OK, I tried this morning to follow the information/procedures from
section 23.3.9 of the user manual, and succeeded in confusing myself
admirably.

Took lustre completely offline, then first checked the LAST_ID on a
known, good OST.  I found this kind of thing:

# od -Ax -td4 /mnt/ost/last_rcvd | more

For ost11, the 8c index value is 28

debugfs -c -R ''dump /O/0/LAST_ID /tmp/LAST_ID'' /dev/sdb ; od
-Ax -td8
/tmp/LAST_ID

debugfs 1.41.10.sun2 (24-Feb-2010)

/dev/sdb: catastrophic mode - not reading inode or group bitmaps

000000                68321

000008

(gdb) p /x 68321

$1 = 0x10ae1

(gdb)

[root@umdist03 ~]# cat /tmp/LAST_ID.asc

0000000: e10a 0100 0000 0000                      á.......

&gt;From this I can see how to edit the LAST_ID.asc for use in the repair
procedure.  All other information was consistent, the /tmp/objects.sdb
ended with this LAST_ID object.

Now, we move on to the "bad" OST.  First, I did an lfs_find yesterday
on just this OST and came up with some 8000 files before it seemed to
cease output.  So, I expected to see SOMETHING on the physical disk. 
But, in fact, the /tmp/objects.sdc showed no content whatsoever?  Just
blank lines, and a direct look at the ls output confirmed that.  And
so, the confusion began.

LAST_ID is, indeed, zero.  

I am checking with my co-conspirator in this.  It is _possible_ that
this OST ID was re-used after a machine was dropped from our system due
to non-recoverable disk/system errors.  So, in my mind, that means it
is possible that the current OST really IS empty of content.  Is that
really what is meant by getting this kind of output from the ls command
for all 9 or 10 of these directories?

/mnt/ost/O/0/d8:

total 0

/mnt/ost/O/0/d9:

total 0

Assuming the disk really is empty then, and LAST_ID really is zero,
shall I then leave it at zero, and follow the recommendation of page
23-14, ie, just shut down again, delete the lov_objid file on the MDS,
and restart the system?  Certainly the value at the correct index (29)
is definitely hosed:

# od -Ax -td8 /mnt/mdt/lov_objid

(snip)

0000d0               292648               346413

0000e0                68225 -7137254917378053186

0000f0                59064                59607

000100                59227                59414

Thanks,

bob

Bob Ball wrote:

Thank you, Bern.  "df" claims there is some 442MB of data on the
volume, compared to neighbors with 285GB.  That could well be a
fragment of a single, unsuccessful transfer attempt.  I can run
lfs_find on it though and see what comes back.  Was having problems
earlier, thought I got files back from that command, but other problems
on our cluster confused that result.  We will recheck.

bob

Bernd Schubert wrote:

On Friday, September 03, 2010, Bob Ball wrote:

We added a new OSS to our 1.8.4 Lustre installation.  It has 6 OST of
8.9TB each.  Within a day of having these on-line, one OST stopped
accepting new files.  I cannot get it to activate.  The other 5 seem fine.

On the MDS "lctl dl" shows it IN, but not UP, and files can be read
from
it: 33 IN osc umt3-OST001d-osc umt3-mdtlov_UUID 5

However, I cannot get it to re-activate:
lctl --device umt3-OST001d-osc activate

[...]

LustreError: 4697:0:(filter.c:3172:filter_handle_precreate())
umt3-OST001d: ignoring bogus orphan destroy request: obdid
11309489156331498430 last_id 0

Can anyone tell me what must be done to recover this disk volume?

Check out section 23.3.9 in the Lustre manual ("How to Fix a Bad LAST_ID on
an
OST).

It is on my TODO list to write tool to automatically correct the
"lov_objid",
but as of now I don''t have it yet. Somehow your lov_objid file has a 
completely wrong value for this OST.
Now, when you say "files can be read from it", are you sure there are
already
files on that OST? Because the error message says that the last_id is zero and 
so you should not have a single file on it. If that is also wrong, you will 
need to correct it as well. You can do that manually, or you can use a patched 
e2fsprogs version, that will do that for you

Patches are here:
https://bugzilla.lustre.org/show_bug.cgi?id=22734

Packages can be found on my home page:
http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/e2fsprogs/

If you want to do it automatically, you will need to create a lfsck mdsdb file 
(the hdr file is sufficient, see the lfsck section in the manual) and then you 
will need to run e2fsck for that OST as if you want to create an OSTDB file. 
That will start pass6, and if you then run e2fsck *without* "-n", so
in
correcting mode, it will correct the LAST_ID file to what it finds on disk. 
With "-v" it will also tell you the old and the new value and then you
will
need to put that value properly coded into the MDS lov_objid file.

Be careful and create backups of the lov_objid and LAST_ID files.

Hope it helps,
Bern

_______________________________________________
Lustre-discuss mailing list
Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


_______________________________________________
Lustre-discuss mailing list
Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Andreas Dilger

2010-Sep-10 14:47 UTC

head link

[Lustre-discuss] Cannot get an OST to activate

On 2010-09-10, at 08:21, Bob Ball wrote:> Now, we move on to the "bad" OST.  First, I did an lfs_find
yesterday on just this OST and came up with some 8000 files before it seemed to
cease output.  So, I expected to see SOMETHING on the physical disk.  But, in
fact, the /tmp/objects.sdc showed no content whatsoever?  Just blank lines, and
a direct look at the ls output confirmed that.  And so, the confusion began.
> 
> LAST_ID is, indeed, zero.  
> 
> I am checking with my co-conspirator in this.  It is _possible_ that this
OST ID was re-used after a machine was dropped from our system due to
non-recoverable disk/system errors.  So, in my mind, that means it is possible
that the current OST really IS empty of content.  Is that really what is meant
by getting this kind of output from the ls command for all 9 or 10 of these
directories?
> 
> /mnt/ost/O/0/d8:
> total 0
> 
> /mnt/ost/O/0/d9:
> total 0
There should be 32 such directories.
> Assuming the disk really is empty then, and LAST_ID really is zero, shall I
then leave it at zero, and follow the recommendation of page 23-14, ie, just
shut down again, delete the lov_objid file on the MDS, and restart the system? 
Certainly the value at the correct index (29) is definitely hosed:
> # od -Ax -td8 /mnt/mdt/lov_objid
> (snip)
> 0000d0               292648               346413
> 0000e0                68225 -7137254917378053186
> 0000f0                59064                59607
> 000100                59227                59414
Yes, that is definitely hosed.  Deleting the lov_objid file from the MDS and
remounting the MDS should fix this value.  You could also just binary edit the
file and set this to 1.
> Bob Ball wrote:
>> Thank you, Bern.  "df" claims there is some 442MB of data on
the volume, compared to neighbors with 285GB.  That could well be a fragment of
a single, unsuccessful transfer attempt.  I can run lfs_find on it though and
see what comes back.  Was having problems earlier, thought I got files back from
that command, but other problems on our cluster confused that result.  We will
recheck.
>> 
>> bob
>> 
>> Bernd Schubert wrote:
>>> On Friday, September 03, 2010, Bob Ball wrote:
>>>   
>>> 
>>>> We added a new OSS to our 1.8.4 Lustre installation.  It has 6
OST of
>>>> 8.9TB each.  Within a day of having these on-line, one OST
stopped
>>>> accepting new files.  I cannot get it to activate.  The other 5
seem fine.
>>>> 
>>>> On the MDS "lctl dl" shows it IN, but not UP, and
files can be read from
>>>> it: 33 IN osc umt3-OST001d-osc umt3-mdtlov_UUID 5
>>>> 
>>>> However, I cannot get it to re-activate:
>>>> lctl --device umt3-OST001d-osc activate
>>>> 
>>>>     
>>>> 
>>> 
>>> [...]
>>> 
>>> 
>>>   
>>> 
>>>> LustreError: 4697:0:(filter.c:3172:filter_handle_precreate())
>>>> umt3-OST001d: ignoring bogus orphan destroy request: obdid
>>>> 11309489156331498430 last_id 0
>>>> 
>>>> Can anyone tell me what must be done to recover this disk
volume?
>>>>     
>>>> 
>>> 
>>> Check out section 23.3.9 in the Lustre manual ("How to Fix a
Bad LAST_ID on an
>>> OST).
>>> 
>>> It is on my TODO list to write tool to automatically correct the
"lov_objid",
>>> but as of now I don''t have it yet. Somehow your lov_objid
file has a
>>> completely wrong value for this OST.
>>> Now, when you say "files can be read from it", are you
sure there are already
>>> files on that OST? Because the error message says that the last_id
is zero and
>>> so you should not have a single file on it. If that is also wrong,
you will
>>> need to correct it as well. You can do that manually, or you can
use a patched
>>> e2fsprogs version, that will do that for you
>>> 
>>> Patches are here:
>>> 
>>> https://bugzilla.lustre.org/show_bug.cgi?id=22734
>>> 
>>> 
>>> Packages can be found on my home page:
>>> 
>>> http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/e2fsprogs/
>>> 
>>> 
>>> 
>>> If you want to do it automatically, you will need to create a lfsck
mdsdb file
>>> (the hdr file is sufficient, see the lfsck section in the manual)
and then you
>>> will need to run e2fsck for that OST as if you want to create an
OSTDB file.
>>> That will start pass6, and if you then run e2fsck *without*
"-n", so in
>>> correcting mode, it will correct the LAST_ID file to what it finds
on disk.
>>> With "-v" it will also tell you the old and the new value
and then you will
>>> need to put that value properly coded into the MDS lov_objid file.
>>> 
>>> 
>>> Be careful and create backups of the lov_objid and LAST_ID files.
>>> 
>>> 
>>> Hope it helps,
>>> Bern
>>> 
>>> 
>>> 
>>>   
>>> 
>> 
>> _______________________________________________
>> Lustre-discuss mailing list
>> 
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>> 
>>   
>> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

Bernd Schubert

2010-Sep-10 14:56 UTC

head link

[Lustre-discuss] Cannot get an OST to activate

>> Assuming the disk really is empty then, and LAST_ID really is zero,
>> shall I then leave it at zero, and follow the recommendation of
>> page 23-14, ie, just shut down again, delete the lov_objid file on
>> the MDS, and restart the system?  Certainly the value at the
>> correct index (29) is definitely hosed: # od -Ax -td8
>> /mnt/mdt/lov_objid (snip) 0000d0               292648
>> 346413 0000e0                68225 -7137254917378053186 0000f0
>> 59064                59607 000100                59227
>> 59414
> 
> Yes, that is definitely hosed.  Deleting the lov_objid file from the
> MDS and remounting the MDS should fix this value.  You could also
> just binary edit the file and set this to 1.
Andreas, Bob, please be very very careful with lov_objid. As I already
wrote last week, I get reproducibly and always a hard kernel panic when
I tested and deleted the file and then mounted the MDT again.
You can try it, but DO CREATE A BACKUP of this file, so that you can
copy it back, if something goes wrong.

Sorry, I don''t have the time right now to work on the
lob_objid-delete-bug, not even time to write a suitable bug report :(


Cheers,
Bernd

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 262 bytes
Desc: OpenPGP digital signature
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100910/dca3bf69/attachment.bin

Bob Ball

2010-Sep-10 15:17 UTC

head link

Re: Cannot get an OST to activate

I just made some random checks on the "lfs find" output for this OST
from yesterday.  Each file I checked was one lost when we had problems
a few months back.  The suggested "unlink" on these did not work in
1.8.3, worked fine on a whole set yesterday with 1.8.4, but I obviously
did not find them all.

So, I am going to assume that this OST is completely empty.  I will
make a backup of the lov_objid file, then see if I can do a binary edit
using xxd, hopefully avoiding the kernel panic.  Crossing my fingers. 
Will announce a short outage here to begin in 45 minutes from now.

bob

Bernd Schubert wrote:

Assuming the disk really is empty then, and LAST_ID really is zero,
shall I then leave it at zero, and follow the recommendation of
page 23-14, ie, just shut down again, delete the lov_objid file on
the MDS, and restart the system?  Certainly the value at the
correct index (29) is definitely hosed: # od -Ax -td8
/mnt/mdt/lov_objid (snip) 0000d0               292648
346413 0000e0                68225 -7137254917378053186 0000f0
59064                59607 000100                59227
59414

Yes, that is definitely hosed.  Deleting the lov_objid file from the
MDS and remounting the MDS should fix this value.  You could also
just binary edit the file and set this to 1.

Andreas, Bob, please be very very careful with lov_objid. As I already
wrote last week, I get reproducibly and always a hard kernel panic when
I tested and deleted the file and then mounted the MDT again.
You can try it, but DO CREATE A BACKUP of this file, so that you can
copy it back, if something goes wrong.

Sorry, I don''t have the time right now to work on the
lob_objid-delete-bug, not even time to write a suitable bug report :(

Cheers,
Bernd


_______________________________________________
Lustre-discuss mailing list
Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Bob Ball

2010-Sep-10 16:49 UTC

head link

Re: Cannot get an OST to activate

OK, this worked.  I was able to rewrite the LAST_ID value stored in the
lov_objid object to a value of 1, and when lustre came up, the ost was
back at "UP" (yay!).

However, there still seem to be problems with that ost, as lfs_find
comes up with files there that do not exist.  I guess at some point
we''ll have to take a complete outage to fix the file system
consistency.  But in the meantime thank you all for your help and
advice.

I must say though, I like this 1.8.4 version much better than 1.8.3. 
We were even able to migrate live between versions to do the upgrade,
so no down-time was involved.

bob

Bob Ball wrote:

I just made some random checks on the "lfs find" output for this OST
from yesterday.  Each file I checked was one lost when we had problems
a few months back.  The suggested "unlink" on these did not work in
1.8.3, worked fine on a whole set yesterday with 1.8.4, but I obviously
did not find them all.

So, I am going to assume that this OST is completely empty.  I will
make a backup of the lov_objid file, then see if I can do a binary edit
using xxd, hopefully avoiding the kernel panic.  Crossing my fingers. 
Will announce a short outage here to begin in 45 minutes from now.

bob

Bernd Schubert wrote:

Assuming the disk really is empty then, and LAST_ID really is zero,
shall I then leave it at zero, and follow the recommendation of
page 23-14, ie, just shut down again, delete the lov_objid file on
the MDS, and restart the system?  Certainly the value at the
correct index (29) is definitely hosed: # od -Ax -td8
/mnt/mdt/lov_objid (snip) 0000d0               292648
346413 0000e0                68225 -7137254917378053186 0000f0
59064                59607 000100                59227
59414

Yes, that is definitely hosed.  Deleting the lov_objid file from the
MDS and remounting the MDS should fix this value.  You could also
just binary edit the file and set this to 1.

Andreas, Bob, please be very very careful with lov_objid. As I already
wrote last week, I get reproducibly and always a hard kernel panic when
I tested and deleted the file and then mounted the MDT again.
You can try it, but DO CREATE A BACKUP of this file, so that you can
copy it back, if something goes wrong.

Sorry, I don''t have the time right now to work on the
lob_objid-delete-bug, not even time to write a suitable bug report :(

Cheers,
Bernd

_______________________________________________
Lustre-discuss mailing list
Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


_______________________________________________
Lustre-discuss mailing list
Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Lustre discuss - Sep 2010 - Cannot get an OST to activate

[Lustre-discuss] Cannot get an OST to activate

[Lustre-discuss] Cannot get an OST to activate

[Lustre-discuss] Cannot get an OST to activate

Re: Cannot get an OST to activate

Re: Cannot get an OST to activate

[Lustre-discuss] Cannot get an OST to activate

[Lustre-discuss] Cannot get an OST to activate

Re: Cannot get an OST to activate

Re: Cannot get an OST to activate