thr3ads.net - Lustre discuss - [Lustre-discuss] MGS/MDS error: operation 101 on unconnected MGS [Sep 2008]

If this information is useful, please help other people find it:
Share via:

Ms. Megan Larko

2008-Sep-29 15:34 UTC

[Lustre-discuss] MGS/MDS error: operation 101 on unconnected MGS

Greetings,

It''s Monday (sigh).  I lost one dual-core Opteron 275 of two on my OSS
box over the weekend.  The /var/log/messages contained many "bus error
on processor" messages.  So Monday I rebooted the OSS with only one
dual core CPU.  The box came up  just fine and I mounted the three
lustre OST disks I have on that box.  (CentOS 5
2.6.18-53.1.13.el5_lustre.1.6.4.3smp #1 SMP Sun Feb 17 08:38:44 EST
2008 x86_64 x86_64 x86_64 GNU/Linux)

The problem is now my MGS/MDS box cannot access/use the disks on that
box.   The single MDT volume mounts without error but I see the
following messages in the MGS/MDS /var/log/messages file:

Sep 29 10:02:40 mds1 kernel: Lustre: MDT crew3-MDT0000 now serving dev
(crew3-MDT0000/be7a58cd-e259-823f-486b-e974551d7ad6) with recovery
enabled
Sep 29 10:02:40 mds1 kernel: Lustre: Server crew3-MDT0000 on device
/dev/md0 has started
Sep 29 10:02:40 mds1 kernel: Lustre: MDS crew3-MDT0000: crew3d1_UUID
now active, resetting orphans
Sep 29 10:02:40 mds1 kernel: Lustre: Skipped 2 previous similar messages
Sep 29 10:03:29 mds1 kernel: LustreError:
26914:0:(mgs_handler.c:515:mgs_handle()) lustre_mgs: operation 101 on
unconnected MGS
Sep 29 10:03:29 mds1 kernel: LustreError:
26914:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error
(-107)  req at ffff81005986d450 x17040407/t0 o101-><?>@<?>:-1
lens 232/0
ref 0 fl Interpret:/0/0 rc -107/0
Sep 29 10:03:29 mds1 kernel: LustreError:
26915:0:(mgs_handler.c:515:mgs_handle()) lustre_mgs: operation 501 on
unconnected MGS
Sep 29 10:03:29 mds1 kernel: LustreError:
26915:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error
(-107)  req at ffff810055ffac50 x17040408/t0 o501-><?>@<?>:-1
lens 200/0
ref 0 fl Interpret:/0/0 rc -107/0

The messages do not repeat.

On MGS/MDS I also have [root at mds1 ~]# cat
/proc/fs/lustre/mds/crew3-MDT0000/recovery_status
status: INACTIVE

The lctl can successfully ping the OSS and the OST appear correctly in
lctl dl.  The peers on the MGS/MDS (which is still successfully
serving other disks) appears normal.
[root at mds1 ~]# cat /proc/sys/lnet/peers
nid                      refs state   max   rtr   min    tx   min queue
0 at lo                        1  ~rtr     0     0     0     0     0 0
172.16.0.15 at o2ib            1  ~rtr     8     8     8     8     7 0
172.18.1.1 at o2ib             1  ~rtr     8     8     8     8  -527 0
172.18.1.2 at o2ib             1  ~rtr     8     8     8     8  -261 0
172.18.0.9 at o2ib             1  ~rtr     8     8     8     8     7 0
172.18.0.10 at o2ib            1  ~rtr     8     8     8     8     6 0
172.18.0.11 at o2ib            1  ~rtr     8     8     8     8  -239 0
172.18.0.12 at o2ib            1  ~rtr     8     8     8     8    -2 0
172.18.0.13 at o2ib            1  ~rtr     8     8     8     8    -4 0
172.18.0.14 at o2ib            1  ~rtr     8     8     8     8    -4 0
172.18.0.15 at o2ib            1  ~rtr     8     8     8     8   -42 0
172.18.0.16 at o2ib            1  ~rtr     8     8     8     8     7 0

With this information I Google searched on the error and I found
http://lustre.sev.net.ua/changeset/119/trunk/lustre.
The page was timestamped 3/12/08 by Author shadow with the info below:

trunk/lustre/ChangeLog
r100   r119
Severity   : major
 	16	Frequency  : frequent on X2 node
 	17	Bugzilla   : 15010
 	18	Description: mdc_set_open_replay_data LBUG
 	19	Details    : Set replay data for requests that are eligible for replay.
 	20	
 	21	Severity   : normal
 	22	Bugzilla   : 14321
 	23	Description: lustre_mgs: operation 101 on unconnected MGS
 	24	Details    : When MGC is disconnected from MGS long enough, MGS
will evict the
 	25	             MGC, and late on MGC cannot successfully connect to
MGS and a lot
 	26	             of the error messages complaining that MGS is not connected.
 	27	
 	28	Severity   : major
16	29	Frequency  : on start mds
17	30	Bugzilla   : 14884


Okay.  I am still running 2.6.18-53.1.13.el5_lustre.1.6.4.3smp,  is
there a way in which to get the MGS/MDS to once again access the OSTs
associated with the MDT?
The OSS box looks perfectly fine (minus one CPU).  All the errors
appear on the MGS/MDS box.   The lustre disk will not mount on any of
my clients.   The message "mount.lustre: mount ic-mds1 at o2ib:/crew3 at
/crew3 failed: Transport endpoint is not connected"   is all that
occurs.

Suggestions and advice greatly appreciated.  Do I just have to wait a
long time to let the disk "find itself"?   Using lctl device xx and
activate did not help.


megan

megan

2008-Sep-30 16:27 UTC

head link

[Lustre-discuss] MGS/MDS error: operation 101 on unconnected MGS

More information:

I tried rebooting my MGD/MDS to see if that would solve my inability
to client-mount this disk problem.
No, it did not.

However,  I did try the following that might offer some insight to a
wise Lustre person.   On the MGS/MDS I did the following:
[root at mds1 tmp.BKUP]# mount -vv -t lustre ic-mds1 at o2ib:/crew3 /
tmp.BKUP/crew3
arg[0] = /sbin/mount.lustre
arg[1] = -v
arg[2] = -o
arg[3] = rw
arg[4] = ic-mds1 at o2ib:/crew3
arg[5] = /tmp.BKUP/crew3
source = ic-mds1 at o2ib:/crew3 (172.18.0.10 at o2ib:/crew3), target = /
tmp.BKUP/crew3
options = rw
mounting device 172.18.0.10 at o2ib:/crew3 at /tmp.BKUP/crew3, flags=0
options=device=172.18.0.10 at o2ib:/crew3
warning: 172.18.0.10 at o2ib:/crew3: cannot resolve: No such file or
directory/sbin/mount.lustre: unable to set tunables for
172.18.0.10 at o2ib:/crew3 (may cause reduced IO performance)[root at mds1
tmp.BKUP]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda2              29G   20G  7.5G  73% /
/dev/sda1             145M   17M  121M  13% /boot
tmpfs                1006M     0 1006M   0% /dev/shm
/dev/sda5              42G  6.2G   34G  16% /srv/lustre_admin
/dev/METADATA1/LV1    204G  5.3G  190G   3% /srv/lustre/mds/crew2-
MDT0000
/dev/sdf               58G  2.5G   52G   5% /srv/lustre/mds/crew8-
MDT0000
/dev/md0              204G  4.4G  188G   3% /srv/lustre/mds/crew3-
MDT0000
ic-mds1 at o2ib:/crew3    19T   17T  1.8T  91% /tmp.BKUP/crew3
[root at mds1 tmp.BKUP]# ls /crew3
ls: /crew3: No such file or directory
[root at mds1 tmp.BKUP]# cd /tmp.BKUP/crew3
[root at mds1 crew3]# ls
data  users
[root at mds1 crew3]# ls users
asahoo  hongbo  kristi  larkoc  qiaox  roshan  tugrul  yluo
[root at mds1 crew3]# cat /proc/fs/lustre/mds/crew3-MDT0000/
recovery_status
status: INACTIVE

Whoa!   The files are there; the disk is there.   The recovery_status
is still "INACTIVE".   Can this be correct?  The disk seems usable on
the (no user access *at all* MGS/MDS).

So on a client I did very similar:
[root at crew01 ~]# mount -vv -t lustre ic-mds1 at o2ib:/crew3 /crew3
arg[0] = /sbin/mount.lustre
arg[1] = -v
arg[2] = -o
arg[3] = rw
arg[4] = ic-mds1 at o2ib:/crew3
arg[5] = /crew3
source = ic-mds1 at o2ib:/crew3 (172.18.0.10 at o2ib:/crew3), target = /
crew3
options = rw
mounting device 172.18.0.10 at o2ib:/crew3 at /crew3, flags=0
options=device=172.18.0.10 at o2ib:/crew3
warning: 172.18.0.10 at o2ib:/crew3: cannot resolve: No such file or
directory/sbin/mount.lustre: unable to set tunables for
172.18.0.10 at o2ib:/crew3 (may cause reduced IO
performance)mount.lustre: mount ic-mds1 at o2ib:/crew3 at /crew3 failed:
Transport endpoint is not connected

Still the same error message "Transport endpoint is not connected".

How can other disks on the same MGS/MDS and same IB switch work on the
client and this one particular disk mount only on the MGS/MDS and not
on any client?

What do I need to do to fix it?

Thank you,
megan


On Sep 29, 11:34?am, "Ms. Megan Larko" <dobsonu... at gmail.com>
wrote:> Greetings,
>
> It''s Monday (sigh). ?I lost one dual-core Opteron 275 of two on my
OSS
> box over the weekend. ?The /var/log/messages contained many "bus error
> on processor" messages. ?So Monday I rebooted the OSS with only one
> dual core CPU. ?The box came up ?just fine and I mounted the three
> lustre OST disks I have on that box. ?(CentOS 5
> 2.6.18-53.1.13.el5_lustre.1.6.4.3smp #1 SMP Sun Feb 17 08:38:44 EST
> 2008 x86_64 x86_64 x86_64 GNU/Linux)
>
> The problem is now my MGS/MDS box cannot access/use the disks on that
> box. ? The single MDT volume mounts without error but I see the
> following messages in the MGS/MDS /var/log/messages file:
>
> Sep 29 10:02:40 mds1 kernel: Lustre: MDT crew3-MDT0000 now serving dev
> (crew3-MDT0000/be7a58cd-e259-823f-486b-e974551d7ad6) with recovery
> enabled
> Sep 29 10:02:40 mds1 kernel: Lustre: Server crew3-MDT0000 on device
> /dev/md0 has started
> Sep 29 10:02:40 mds1 kernel: Lustre: MDS crew3-MDT0000: crew3d1_UUID
> now active, resetting orphans
> Sep 29 10:02:40 mds1 kernel: Lustre: Skipped 2 previous similar messages
> Sep 29 10:03:29 mds1 kernel: LustreError:
> 26914:0:(mgs_handler.c:515:mgs_handle()) lustre_mgs: operation 101 on
> unconnected MGS
> Sep 29 10:03:29 mds1 kernel: LustreError:
> 26914:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error
> (-107) ?req at ffff81005986d450 x17040407/t0
o101-><?>@<?>:-1 lens 232/0
> ref 0 fl Interpret:/0/0 rc -107/0
> Sep 29 10:03:29 mds1 kernel: LustreError:
> 26915:0:(mgs_handler.c:515:mgs_handle()) lustre_mgs: operation 501 on
> unconnected MGS
> Sep 29 10:03:29 mds1 kernel: LustreError:
> 26915:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error
> (-107) ?req at ffff810055ffac50 x17040408/t0
o501-><?>@<?>:-1 lens 200/0
> ref 0 fl Interpret:/0/0 rc -107/0
>
> The messages do not repeat.
>
> On MGS/MDS I also have [root at mds1 ~]# cat
> /proc/fs/lustre/mds/crew3-MDT0000/recovery_status
> status: INACTIVE
>
> The lctl can successfully ping the OSS and the OST appear correctly in
> lctl dl. ?The peers on the MGS/MDS (which is still successfully
> serving other disks) appears normal.
> [root at mds1 ~]# cat /proc/sys/lnet/peers
> nid ? ? ? ? ? ? ? ? ? ? ?refs state ? max ? rtr ? min ? ?tx ? min queue
> 0 at lo ? ? ? ? ? ? ? ? ? ? ? ?1 ?~rtr ? ? 0 ? ? 0 ? ? 0 ? ? 0 ? ? 0 0
> 172.16.0.15 at o2ib ? ? ? ? ? ?1 ?~rtr ? ? 8 ? ? 8 ? ? 8 ? ? 8 ? ? 7 0
> 172.18.1.1 at o2ib ? ? ? ? ? ? 1 ?~rtr ? ? 8 ? ? 8 ? ? 8 ? ? 8 ?-527 0
> 172.18.1.2 at o2ib ? ? ? ? ? ? 1 ?~rtr ? ? 8 ? ? 8 ? ? 8 ? ? 8 ?-261 0
> 172.18.0.9 at o2ib ? ? ? ? ? ? 1 ?~rtr ? ? 8 ? ? 8 ? ? 8 ? ? 8 ? ? 7 0
> 172.18.0.10 at o2ib ? ? ? ? ? ?1 ?~rtr ? ? 8 ? ? 8 ? ? 8 ? ? 8 ? ? 6 0
> 172.18.0.11 at o2ib ? ? ? ? ? ?1 ?~rtr ? ? 8 ? ? 8 ? ? 8 ? ? 8 ?-239 0
> 172.18.0.12 at o2ib ? ? ? ? ? ?1 ?~rtr ? ? 8 ? ? 8 ? ? 8 ? ? 8 ? ?-2 0
> 172.18.0.13 at o2ib ? ? ? ? ? ?1 ?~rtr ? ? 8 ? ? 8 ? ? 8 ? ? 8 ? ?-4 0
> 172.18.0.14 at o2ib ? ? ? ? ? ?1 ?~rtr ? ? 8 ? ? 8 ? ? 8 ? ? 8 ? ?-4 0
> 172.18.0.15 at o2ib ? ? ? ? ? ?1 ?~rtr ? ? 8 ? ? 8 ? ? 8 ? ? 8 ? -42 0
> 172.18.0.16 at o2ib ? ? ? ? ? ?1 ?~rtr ? ? 8 ? ? 8 ? ? 8 ? ? 8 ? ? 7 0
>
> With this information I Google searched on the error and I
foundhttp://lustre.sev.net.ua/changeset/119/trunk/lustre.
> The page was timestamped 3/12/08 by Author shadow with the info below:
>
> trunk/lustre/ChangeLog
> r100 ? r119
> Severity ? : major
> ? ? ? ? 16 ? ? ?Frequency ?: frequent on X2 node
> ? ? ? ? 17 ? ? ?Bugzilla ? : 15010
> ? ? ? ? 18 ? ? ?Description: mdc_set_open_replay_data LBUG
> ? ? ? ? 19 ? ? ?Details ? ?: Set replay data for requests that are eligible
for replay.
> ? ? ? ? 20 ? ? ?
> ? ? ? ? 21 ? ? ?Severity ? : normal
> ? ? ? ? 22 ? ? ?Bugzilla ? : 14321
> ? ? ? ? 23 ? ? ?Description: lustre_mgs: operation 101 on unconnected MGS
> ? ? ? ? 24 ? ? ?Details ? ?: When MGC is disconnected from MGS long enough,
MGS
> will evict the
> ? ? ? ? 25 ? ? ? ? ? ? ? ? ? MGC, and late on MGC cannot successfully
connect to
> MGS and a lot
> ? ? ? ? 26 ? ? ? ? ? ? ? ? ? of the error messages complaining that MGS is
not connected.
> ? ? ? ? 27 ? ? ?
> ? ? ? ? 28 ? ? ?Severity ? : major
> 16 ? ? ?29 ? ? ?Frequency ?: on start mds
> 17 ? ? ?30 ? ? ?Bugzilla ? : 14884
>
> Okay. ?I am still running 2.6.18-53.1.13.el5_lustre.1.6.4.3smp, ?is
> there a way in which to get the MGS/MDS to once again access the OSTs
> associated with the MDT?
> The OSS box looks perfectly fine (minus one CPU). ?All the errors
> appear on the MGS/MDS box. ? The lustre disk will not mount on any of
> my clients. ? The message "mount.lustre: mount ic-mds1 at o2ib:/crew3
at
> /crew3 failed: Transport endpoint is not connected" ? is all that
> occurs.
>
> Suggestions and advice greatly appreciated. ?Do I just have to wait a
> long time to let the disk "find itself"? ? Using lctl device xx
and
> activate did not help.
>
> megan
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-disc... at
lists.lustre.orghttp://lists.lustre.org/mailman/listinfo/lustre-discuss

Wojciech Turek

2008-Sep-30 17:07 UTC

head link

[Lustre-discuss] MGS/MDS error: operation 101 on unconnected MGS

Hi,

Can you please run from the client server (crew01) following command 
line and paste the output here.

lctl ping ic-mds1 at o2ib

Cheers,

Wojciech
 


megan wrote:> More information:
>
> I tried rebooting my MGD/MDS to see if that would solve my inability
> to client-mount this disk problem.
> No, it did not.
>
> However,  I did try the following that might offer some insight to a
> wise Lustre person.   On the MGS/MDS I did the following:
> [root at mds1 tmp.BKUP]# mount -vv -t lustre ic-mds1 at o2ib:/crew3 /
> tmp.BKUP/crew3
> arg[0] = /sbin/mount.lustre
> arg[1] = -v
> arg[2] = -o
> arg[3] = rw
> arg[4] = ic-mds1 at o2ib:/crew3
> arg[5] = /tmp.BKUP/crew3
> source = ic-mds1 at o2ib:/crew3 (172.18.0.10 at o2ib:/crew3), target = /
> tmp.BKUP/crew3
> options = rw
> mounting device 172.18.0.10 at o2ib:/crew3 at /tmp.BKUP/crew3, flags=0
> options=device=172.18.0.10 at o2ib:/crew3
> warning: 172.18.0.10 at o2ib:/crew3: cannot resolve: No such file or
> directory/sbin/mount.lustre: unable to set tunables for
> 172.18.0.10 at o2ib:/crew3 (may cause reduced IO performance)[root at mds1
> tmp.BKUP]# df -h
> Filesystem            Size  Used Avail Use% Mounted on
> /dev/sda2              29G   20G  7.5G  73% /
> /dev/sda1             145M   17M  121M  13% /boot
> tmpfs                1006M     0 1006M   0% /dev/shm
> /dev/sda5              42G  6.2G   34G  16% /srv/lustre_admin
> /dev/METADATA1/LV1    204G  5.3G  190G   3% /srv/lustre/mds/crew2-
> MDT0000
> /dev/sdf               58G  2.5G   52G   5% /srv/lustre/mds/crew8-
> MDT0000
> /dev/md0              204G  4.4G  188G   3% /srv/lustre/mds/crew3-
> MDT0000
> ic-mds1 at o2ib:/crew3    19T   17T  1.8T  91% /tmp.BKUP/crew3
> [root at mds1 tmp.BKUP]# ls /crew3
> ls: /crew3: No such file or directory
> [root at mds1 tmp.BKUP]# cd /tmp.BKUP/crew3
> [root at mds1 crew3]# ls
> data  users
> [root at mds1 crew3]# ls users
> asahoo  hongbo  kristi  larkoc  qiaox  roshan  tugrul  yluo
> [root at mds1 crew3]# cat /proc/fs/lustre/mds/crew3-MDT0000/
> recovery_status
> status: INACTIVE
>
> Whoa!   The files are there; the disk is there.   The recovery_status
> is still "INACTIVE".   Can this be correct?  The disk seems
usable on
> the (no user access *at all* MGS/MDS).
>
> So on a client I did very similar:
> [root at crew01 ~]# mount -vv -t lustre ic-mds1 at o2ib:/crew3 /crew3
> arg[0] = /sbin/mount.lustre
> arg[1] = -v
> arg[2] = -o
> arg[3] = rw
> arg[4] = ic-mds1 at o2ib:/crew3
> arg[5] = /crew3
> source = ic-mds1 at o2ib:/crew3 (172.18.0.10 at o2ib:/crew3), target = /
> crew3
> options = rw
> mounting device 172.18.0.10 at o2ib:/crew3 at /crew3, flags=0
> options=device=172.18.0.10 at o2ib:/crew3
> warning: 172.18.0.10 at o2ib:/crew3: cannot resolve: No such file or
> directory/sbin/mount.lustre: unable to set tunables for
> 172.18.0.10 at o2ib:/crew3 (may cause reduced IO
> performance)mount.lustre: mount ic-mds1 at o2ib:/crew3 at /crew3 failed:
> Transport endpoint is not connected
>
> Still the same error message "Transport endpoint is not
connected".
>
> How can other disks on the same MGS/MDS and same IB switch work on the
> client and this one particular disk mount only on the MGS/MDS and not
> on any client?
>
> What do I need to do to fix it?
>
> Thank you,
> megan
>
>
> On Sep 29, 11:34 am, "Ms. Megan Larko" <dobsonu... at
gmail.com> wrote:
>   
>> Greetings,
>>
>> It''s Monday (sigh).  I lost one dual-core Opteron 275 of two
on my OSS
>> box over the weekend.  The /var/log/messages contained many "bus
error
>> on processor" messages.  So Monday I rebooted the OSS with only
one
>> dual core CPU.  The box came up  just fine and I mounted the three
>> lustre OST disks I have on that box.  (CentOS 5
>> 2.6.18-53.1.13.el5_lustre.1.6.4.3smp #1 SMP Sun Feb 17 08:38:44 EST
>> 2008 x86_64 x86_64 x86_64 GNU/Linux)
>>
>> The problem is now my MGS/MDS box cannot access/use the disks on that
>> box.   The single MDT volume mounts without error but I see the
>> following messages in the MGS/MDS /var/log/messages file:
>>
>> Sep 29 10:02:40 mds1 kernel: Lustre: MDT crew3-MDT0000 now serving dev
>> (crew3-MDT0000/be7a58cd-e259-823f-486b-e974551d7ad6) with recovery
>> enabled
>> Sep 29 10:02:40 mds1 kernel: Lustre: Server crew3-MDT0000 on device
>> /dev/md0 has started
>> Sep 29 10:02:40 mds1 kernel: Lustre: MDS crew3-MDT0000: crew3d1_UUID
>> now active, resetting orphans
>> Sep 29 10:02:40 mds1 kernel: Lustre: Skipped 2 previous similar
messages
>> Sep 29 10:03:29 mds1 kernel: LustreError:
>> 26914:0:(mgs_handler.c:515:mgs_handle()) lustre_mgs: operation 101 on
>> unconnected MGS
>> Sep 29 10:03:29 mds1 kernel: LustreError:
>> 26914:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error
>> (-107)  req at ffff81005986d450 x17040407/t0
o101-><?>@<?>:-1 lens 232/0
>> ref 0 fl Interpret:/0/0 rc -107/0
>> Sep 29 10:03:29 mds1 kernel: LustreError:
>> 26915:0:(mgs_handler.c:515:mgs_handle()) lustre_mgs: operation 501 on
>> unconnected MGS
>> Sep 29 10:03:29 mds1 kernel: LustreError:
>> 26915:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error
>> (-107)  req at ffff810055ffac50 x17040408/t0
o501-><?>@<?>:-1 lens 200/0
>> ref 0 fl Interpret:/0/0 rc -107/0
>>
>> The messages do not repeat.
>>
>> On MGS/MDS I also have [root at mds1 ~]# cat
>> /proc/fs/lustre/mds/crew3-MDT0000/recovery_status
>> status: INACTIVE
>>
>> The lctl can successfully ping the OSS and the OST appear correctly in
>> lctl dl.  The peers on the MGS/MDS (which is still successfully
>> serving other disks) appears normal.
>> [root at mds1 ~]# cat /proc/sys/lnet/peers
>> nid                      refs state   max   rtr   min    tx   min queue
>> 0 at lo                        1  ~rtr     0     0     0     0     0 0
>> 172.16.0.15 at o2ib            1  ~rtr     8     8     8     8     7 0
>> 172.18.1.1 at o2ib             1  ~rtr     8     8     8     8  -527 0
>> 172.18.1.2 at o2ib             1  ~rtr     8     8     8     8  -261 0
>> 172.18.0.9 at o2ib             1  ~rtr     8     8     8     8     7 0
>> 172.18.0.10 at o2ib            1  ~rtr     8     8     8     8     6 0
>> 172.18.0.11 at o2ib            1  ~rtr     8     8     8     8  -239 0
>> 172.18.0.12 at o2ib            1  ~rtr     8     8     8     8    -2 0
>> 172.18.0.13 at o2ib            1  ~rtr     8     8     8     8    -4 0
>> 172.18.0.14 at o2ib            1  ~rtr     8     8     8     8    -4 0
>> 172.18.0.15 at o2ib            1  ~rtr     8     8     8     8   -42 0
>> 172.18.0.16 at o2ib            1  ~rtr     8     8     8     8     7 0
>>
>> With this information I Google searched on the error and I
foundhttp://lustre.sev.net.ua/changeset/119/trunk/lustre.
>> The page was timestamped 3/12/08 by Author shadow with the info below:
>>
>> trunk/lustre/ChangeLog
>> r100   r119
>> Severity   : major
>>         16      Frequency  : frequent on X2 node
>>         17      Bugzilla   : 15010
>>         18      Description: mdc_set_open_replay_data LBUG
>>         19      Details    : Set replay data for requests that are
eligible for replay.
>>         20      
>>         21      Severity   : normal
>>         22      Bugzilla   : 14321
>>         23      Description: lustre_mgs: operation 101 on unconnected
MGS
>>         24      Details    : When MGC is disconnected from MGS long
enough, MGS
>> will evict the
>>         25                   MGC, and late on MGC cannot successfully
connect to
>> MGS and a lot
>>         26                   of the error messages complaining that MGS
is not connected.
>>         27      
>>         28      Severity   : major
>> 16      29      Frequency  : on start mds
>> 17      30      Bugzilla   : 14884
>>
>> Okay.  I am still running 2.6.18-53.1.13.el5_lustre.1.6.4.3smp,  is
>> there a way in which to get the MGS/MDS to once again access the OSTs
>> associated with the MDT?
>> The OSS box looks perfectly fine (minus one CPU).  All the errors
>> appear on the MGS/MDS box.   The lustre disk will not mount on any of
>> my clients.   The message "mount.lustre: mount ic-mds1 at
o2ib:/crew3 at
>> /crew3 failed: Transport endpoint is not connected"   is all that
>> occurs.
>>
>> Suggestions and advice greatly appreciated.  Do I just have to wait a
>> long time to let the disk "find itself"?   Using lctl device
xx and
>> activate did not help.
>>
>> megan
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-disc... at
lists.lustre.orghttp://lists.lustre.org/mailman/listinfo/lustre-discuss
>>     
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>   
-- 
Wojciech Turek

Assistant System Manager
High Performance Computing Service
University of Cambridge
Email: wjt27 at cam.ac.uk
Tel: (+)44 1223 763517

Lustre discuss - Sep 2008 - MGS/MDS error: operation 101 on unconnected MGS

[Lustre-discuss] MGS/MDS error: operation 101 on unconnected MGS

[Lustre-discuss] MGS/MDS error: operation 101 on unconnected MGS

[Lustre-discuss] MGS/MDS error: operation 101 on unconnected MGS