Hi Again,
This is a follow-on to thread "One Lustre Client lost One Lustre
Disk".
I had lost one of two lustre disks after a client reboot. The disk
itself seemed fine; it just would not mount. This occurred for two
clients. I thought that I perhaps needed to update the Lustre disk
information on the MGS being that I had done a successful move of the
non-MGS disk a short while ago and that the restore procedure used a
tunefs.lustre --writeconf..... command to kind of update the disk.
I documented this in the "One Lustre Client lost One Lustre
Disk--solved" email to the Lustre list.
Well---perhaps not quite so "solved". My users have noticed some
files on the remounted lustre disk to be inaccessible. What is more
unusual is that the inaccessible file is non-consistent. It could be
one file now and a different file 10 minutes later.
To start, the system is CentOS 5.1 running 2.6.18-53.1.13.el5 on the
clients and 2.6.18-53.1.13.el5_lustre.1.6.4.3smp on the MGS/MDS and
OSS. There are no errors in /var/log/messages in either the OSS nor
the MGS/MDS. I have only my ntpd routine timestamps.
On the Lustre clients, I receive the following errors in
/var/log/messages when I access the disk I "tunefs.lustre" from
yesterday:
Jul 17 09:56:58 cn2 kernel: LustreError:
13910:0:(lov_ea.c:228:lsm_unpackmd_plain()) OST index 0 missing
Jul 17 09:56:58 cn2 kernel: LustreError:
13910:0:(lov_ea.c:228:lsm_unpackmd_plain()) Skipped 1 previous similar
message
Jul 17 09:56:58 cn2 kernel: Lustre:
13910:0:(lov_pack.c:47:lov_dump_lmm_v1()) objid 0x36716c8, magic
0x0bd10bd0, pattern 0x1
Jul 17 09:56:58 cn2 kernel: Lustre:
13910:0:(lov_pack.c:50:lov_dump_lmm_v1()) stripe_size 1048576,
stripe_count 1
Jul 17 09:56:58 cn2 kernel: Lustre:
13910:0:(lov_pack.c:56:lov_dump_lmm_v1()) stripe 0 idx 0 subobj
0x0/0x78ac47
Jul 17 09:57:00 cn2 kernel: Lustre:
3613:0:(lov_pack.c:47:lov_dump_lmm_v1()) objid 0x36716c8, magic
0x0bd10bd0, pattern 0x1
Jul 17 09:57:00 cn2 kernel: Lustre:
3613:0:(lov_pack.c:50:lov_dump_lmm_v1()) stripe_size 1048576,
stripe_count 1
Jul 17 09:57:00 cn2 kernel: Lustre:
3613:0:(lov_pack.c:56:lov_dump_lmm_v1()) stripe 0 idx 0 subobj
0x0/0x78ac47
Jul 17 09:57:00 cn2 kernel: Lustre:
13912:0:(lov_pack.c:47:lov_dump_lmm_v1()) objid 0x36716c8, magic
0x0bd10bd0, pattern 0x1
Jul 17 09:57:00 cn2 kernel: Lustre:
13912:0:(lov_pack.c:50:lov_dump_lmm_v1()) stripe_size 1048576,
stripe_count 1
Jul 17 09:57:00 cn2 kernel: Lustre:
13912:0:(lov_pack.c:56:lov_dump_lmm_v1()) stripe 0 idx 0 subobj
0x0/0x78ac47
A Google search showed the error message "OST index 0 missing" related
to a missing piece of OST hw in a Lustre disk. That is not the case
here. Even checking the the RAID array on the OSS for hw errors shows
none listed.
I am attaching two screen shots (abrfc.png and ncdump.png) which show
a file that is not present and then is present a few minutes later
while a different file is not accessible. This seems to be only
effecting a small number of the total files on the lustre disk
particularly text files.
Is this situation about "OST index 0 missing" fixable? If yes, how?
Should I mount the Lustre disk read-only for now?
Any advice is genuinely appreciated.
Megan Larko
Any
-------------- next part --------------
A non-text attachment was scrubbed...
Name: abrfc.png
Type: image/png
Size: 1083331 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090717/55f9d97e/attachment-0002.png
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ncdump.png
Type: image/png
Size: 973329 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090717/55f9d97e/attachment-0003.png