thr3ads.net - Lustre discuss - [Lustre-discuss] Re-using OST and MDT names [Aug 2008]

If this information is useful, please help other people find it:
Share via:

Ms. Megan Larko

2008-Aug-14 16:55 UTC

[Lustre-discuss] Re-using OST and MDT names

Hello,

As a part of my continuing to benchmark Lustre to ascertain was may be
best-suited for our needs here, I have re-created at the LSI 8888ELP
card level some of my arrays from my earlier benchmark posts.  The
card is now sending /dev/sdf 998999Mb with 128kB stripesize and
/dev/sdg 6992995 Mb with 128kB stripesize to my OSS.  The sdf and sdg
formatted fine with Lustre and mounted without issue on the OSS.
Recycling the MGS MDT''s seem to have been a problem.  When I tried to
mount the MDT on the MGS after mounting the new OST''s the mounts
performed without error, but the bonnie benchmark test as run before
would hang every time.

Sample of errors in MGS file /var/log/messages:
Aug 13 12:39:30 mds1 kernel: Lustre: crew5-OST0001-osc: Connection to
service crew5-OST000
1 via nid 172.18.0.15 at o2ib was lost; in progress operations using this
service will wait for recovery to complete.
                                      Aug 13 12:39:30 mds1 kernel:
LustreError: 167-0: This client was evicted by crew5-OST0001;
 in progress operations using this service will fail.
                   Aug 13 12:39:30 mds1 kernel: Lustre:
crew5-OST0001-osc: Connection restored to service crew5-OST0001 using
nid 172.18.0.15 at o2ib.
Aug 13 12:39:30 mds1 kernel: Lustre: MDS crew5-MDT0000:
crew5-OST0001_UUID now active, resetting orphans
                                                     Aug 13 12:42:42
mds1 kernel: Lustre: 3406:0:(ldlm_lib.c:519:target_handle_reconnect())
cre
w5-MDT0000: 50b043bb-0e8c-7a5b-b0fe-6bdb67d21e0b reconnecting
                   Aug 13 12:42:42 mds1 kernel: Lustre:
3406:0:(ldlm_lib.c:519:target_handle_reconnect()) Skipped 24 previous
similar messages
Aug 13 12:42:42 mds1 kernel: Lustre:
3406:0:(ldlm_lib.c:747:target_handle_connect()) crew5-MDT0000: refuse
reconnection from
50b043bb-0e8c-7a5b-b0fe-6bdb67d21e0b at 172.18.0.14@o2ib to
0xffff81006994d000; still busy with 2 active RPCs
Aug 13 12:42:42 mds1 kernel: Lustre:
3406:0:(ldlm_lib.c:747:target_handle_connect()) Skipped 24 previous
similar messages
    Aug 13 12:42:42 mds1 kernel: LustreError:
3406:0:(ldlm_lib.c:1442:target_send_reply_msg())
 @@@ processing error (-16)  req at ffff81007b621400 x600107/t0
o38->50b043bb-0e8c-7a5b-b0fe-6bdb67d21e0b at NET_0x50000ac12000e_UUID:-1
lens 304/200 ref 0 fl Interpret:/0/0 rc -16/0    Aug 13 12:42:42 mds1
kernel: LustreError: 3406:0:(ldlm_lib.c:1442:target_send_reply_msg())
 Skipped 24 previous similar messages
                   Aug 13 12:43:40 mds1 kernel: LustreError: 11-0: an
error occurred while communicating with 172.18.0.15 at o2ib. The
ost_connect operation failed with -19
Aug 13 12:43:40 mds1 kernel: LustreError: Skipped 7 previous similar
messages             Aug 13 12:47:50 mds1 kernel: Lustre:
crew5-OST0001-osc: Connection to service crew5-OST0001 via nid
172.18.0.15 at o2ib was lost; in progress operations using this service
will wait f
or recovery to complete.

Sample of errors on OSS file /var/log/messages:
Aug 13 12:39:30 oss4 kernel: Lustre: crew5-OST0001: received MDS
connection from 172.18.0.10 at o2ib
Aug 13 12:43:57 oss4 kernel: Lustre: crew5-OST0001: haven''t heard from
client crew5-mdtlov_UUID (at 172.18.0.10 at o2ib) in 267 seconds. I think
it''s dead, and I am evicting it.
Aug 13 12:46:27 oss4 kernel: LustreError: 137-5: UUID
''crew5-OST0000_UUID'' is not available  for connect (no target)
Aug 13 12:46:27 oss4 kernel: LustreError: Skipped 51 previous similar messages
Aug 13 12:46:27 oss4 kernel: LustreError:
4151:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error
(-19)  req at ffff81042fa56800 x600171/t0 o8-><?>@<?>:-1 lens
240/0 ref 0
fl Interpret:/0/0 rc -19/0
Aug 13 12:46:27 oss4 kernel: LustreError:
4151:0:(ldlm_lib.c:1442:target_send_reply_msg()) Skipped 52 previous
similar messages
Aug 13 12:47:50 oss4 kernel: Lustre: crew5-OST0001: received MDS
connection from 172.18.0.10 at o2ib

In lctl all pings were successful.  Additionally files on Lustre disks
on our live system using the same MGS were all fine; no errors in
logfile.

I thought that maybe changing the disk kB size and reformatting the
OST without reformatting the MDT was a problem.  So I unmounted OST
and MDT and reformatted the MDT on the MGS.  All okay.   The OST
remount without error.    The MDT on the MGS will not remount:

[root at mds1 ~]# mount -t lustre /dev/sdg1 /srv/lustre/mds/crew5-MDT0000
mount.lustre: mount /dev/sdg1 at /srv/lustre/mds/crew5-MDT0000 failed:
Address already in use
The target service''s index is already in use. (/dev/sdg1)

Again, the live systems on the MGS are still fine.   A web search for
the error suggested I try " tunefs.lustre --reformat --index=0
--writeconf=/dev/sdg1" but I was unable to get a syntax of that
command that would run for me (I tried various index= and adding
/dev/sdg1 at the end of the line but it failed each time and only
reprinted the help without indicating what about what I typed was
unparseable).

My current thought is that a stripesize of 128kB from the LSI 8888ELP
card is not testable on my Lustre 1.6.4 system.  This does not seem to
be an accurate statement from what I have read of Lustre but seems to
be what is occurring on my systems.  I will test one more time back at
64kB stripesize.

megan

megan

2008-Aug-15 20:08 UTC

head link

[Lustre-discuss] Re-using OST and MDT names

In replying to myself here,
I did mange to get my disk mounted to perform a new benchmark test.
I could not figure a way around the "mount.lustre: mount /dev/sdg1 at /
srv/lustre/mds/crew5-MDT0000 failed:
Address already in use; The target service''s index is already in use.
(/dev/sdg1)"  error.
Even rebooting both the OSS and MDS did not help.

So, being as this was just a test of hw with a different stripesize
setting on an LSI 8888ELP RAID card (128kB in place of default 64kB),
I re-created both the OST and the MDT using a new, unique fsname and
all of the same hardware.

This worked like a charm.

....someday I will have to figure out what to do with the "cast-off"
MDT names which I apparently may no longer use...

Any comment, observations, suggestions appreciated.

Later,
megan


On Aug 14, 12:55?pm, "Ms. Megan Larko" <dobsonu... at gmail.com>
wrote:> Hello,
>
> As a part of my continuing to benchmark Lustre to ascertain was may be
> best-suited for our needs here, I have re-created at the LSI 8888ELP
> card level some of my arrays from my earlier benchmark posts. ?The
> card is now sending /dev/sdf 998999Mb with 128kB stripesize and
> /dev/sdg 6992995 Mb with 128kB stripesize to my OSS. ?The sdf and sdg
> formatted fine with Lustre and mounted without issue on the OSS.
> Recycling the MGS MDT''s seem to have been a problem. ?When I tried
to
> mount the MDT on the MGS after mounting the new OST''s the mounts
> performed without error, but the bonnie benchmark test as run before
> would hang every time.
>
> Sample of errors in MGS file /var/log/messages:
> Aug 13 12:39:30 mds1 kernel: Lustre: crew5-OST0001-osc: Connection to
> service crew5-OST000
> 1 via nid 172.18.0.15 at o2ib was lost; in progress operations using this
> service will wait for recovery to complete.
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Aug 13 12:39:30 mds1 kernel:
> LustreError: 167-0: This client was evicted by crew5-OST0001;
> ?in progress operations using this service will fail.
> ? ? ? ? ? ? ? ? ? ?Aug 13 12:39:30 mds1 kernel: Lustre:
> crew5-OST0001-osc: Connection restored to service crew5-OST0001 using
> nid 172.18.0.15 at o2ib.
> Aug 13 12:39:30 mds1 kernel: Lustre: MDS crew5-MDT0000:
> crew5-OST0001_UUID now active, resetting orphans
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Aug 13 12:42:42
> mds1 kernel: Lustre: 3406:0:(ldlm_lib.c:519:target_handle_reconnect())
> cre
> w5-MDT0000: 50b043bb-0e8c-7a5b-b0fe-6bdb67d21e0b reconnecting
> ? ? ? ? ? ? ? ? ? ?Aug 13 12:42:42 mds1 kernel: Lustre:
> 3406:0:(ldlm_lib.c:519:target_handle_reconnect()) Skipped 24 previous
> similar messages
> Aug 13 12:42:42 mds1 kernel: Lustre:
> 3406:0:(ldlm_lib.c:747:target_handle_connect()) crew5-MDT0000: refuse
> reconnection from
> 50b043bb-0e8c-7a5b-b0fe-6bdb67d21... at 172.18.0.14@o2ib to
> 0xffff81006994d000; still busy with 2 active RPCs
> Aug 13 12:42:42 mds1 kernel: Lustre:
> 3406:0:(ldlm_lib.c:747:target_handle_connect()) Skipped 24 previous
> similar messages
> ? ? Aug 13 12:42:42 mds1 kernel: LustreError:
> 3406:0:(ldlm_lib.c:1442:target_send_reply_msg())
> ?@@@ processing error (-16) ?req at ffff81007b621400 x600107/t0
> o38->50b043bb-0e8c-7a5b-b0fe-6bdb67d21e0b at NET_0x50000ac12000e_UUID:-1
> lens 304/200 ref 0 fl Interpret:/0/0 rc -16/0 ? ?Aug 13 12:42:42 mds1
> kernel: LustreError: 3406:0:(ldlm_lib.c:1442:target_send_reply_msg())
> ?Skipped 24 previous similar messages
> ? ? ? ? ? ? ? ? ? ?Aug 13 12:43:40 mds1 kernel: LustreError: 11-0: an
> error occurred while communicating with 172.18.0.15 at o2ib. The
> ost_connect operation failed with -19
> Aug 13 12:43:40 mds1 kernel: LustreError: Skipped 7 previous similar
> messages ? ? ? ? ? ? Aug 13 12:47:50 mds1 kernel: Lustre:
> crew5-OST0001-osc: Connection to service crew5-OST0001 via nid
> 172.18.0.15 at o2ib was lost; in progress operations using this service
> will wait f
> or recovery to complete.
>
> Sample of errors on OSS file /var/log/messages:
> Aug 13 12:39:30 oss4 kernel: Lustre: crew5-OST0001: received MDS
> connection from 172.18.0.10 at o2ib
> Aug 13 12:43:57 oss4 kernel: Lustre: crew5-OST0001: haven''t heard
from
> client crew5-mdtlov_UUID (at 172.18.0.10 at o2ib) in 267 seconds. I think
> it''s dead, and I am evicting it.
> Aug 13 12:46:27 oss4 kernel: LustreError: 137-5: UUID
> ''crew5-OST0000_UUID'' is not available ?for connect (no
target)
> Aug 13 12:46:27 oss4 kernel: LustreError: Skipped 51 previous similar
messages
> Aug 13 12:46:27 oss4 kernel: LustreError:
> 4151:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error
> (-19) ?req at ffff81042fa56800 x600171/t0 o8-><?>@<?>:-1
lens 240/0 ref 0
> fl Interpret:/0/0 rc -19/0
> Aug 13 12:46:27 oss4 kernel: LustreError:
> 4151:0:(ldlm_lib.c:1442:target_send_reply_msg()) Skipped 52 previous
> similar messages
> Aug 13 12:47:50 oss4 kernel: Lustre: crew5-OST0001: received MDS
> connection from 172.18.0.10 at o2ib
>
> In lctl all pings were successful. ?Additionally files on Lustre disks
> on our live system using the same MGS were all fine; no errors in
> logfile.
>
> I thought that maybe changing the disk kB size and reformatting the
> OST without reformatting the MDT was a problem. ?So I unmounted OST
> and MDT and reformatted the MDT on the MGS. ?All okay. ? The OST
> remount without error. ? ?The MDT on the MGS will not remount:
>
> [root at mds1 ~]# mount -t lustre /dev/sdg1 /srv/lustre/mds/crew5-MDT0000
> mount.lustre: mount /dev/sdg1 at /srv/lustre/mds/crew5-MDT0000 failed:
> Address already in use
> The target service''s index is already in use. (/dev/sdg1)
>
> Again, the live systems on the MGS are still fine. ? A web search for
> the error suggested I try " tunefs.lustre --reformat --index=0
> --writeconf=/dev/sdg1" but I was unable to get a syntax of that
> command that would run for me (I tried various index= and adding
> /dev/sdg1 at the end of the line but it failed each time and only
> reprinted the help without indicating what about what I typed was
> unparseable).
>
> My current thought is that a stripesize of 128kB from the LSI 8888ELP
> card is not testable on my Lustre 1.6.4 system. ?This does not seem to
> be an accurate statement from what I have read of Lustre but seems to
> be what is occurring on my systems. ?I will test one more time back at
> 64kB stripesize.
>
> megan
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-disc... at
lists.lustre.orghttp://lists.lustre.org/mailman/listinfo/lustre-discuss

Lustre discuss - Aug 2008 - Re-using OST and MDT names

[Lustre-discuss] Re-using OST and MDT names

[Lustre-discuss] Re-using OST and MDT names