Ms. Megan Larko
2008-Sep-29 15:34 UTC
[Lustre-discuss] MGS/MDS error: operation 101 on unconnected MGS
Greetings, It''s Monday (sigh). I lost one dual-core Opteron 275 of two on my OSS box over the weekend. The /var/log/messages contained many "bus error on processor" messages. So Monday I rebooted the OSS with only one dual core CPU. The box came up just fine and I mounted the three lustre OST disks I have on that box. (CentOS 5 2.6.18-53.1.13.el5_lustre.1.6.4.3smp #1 SMP Sun Feb 17 08:38:44 EST 2008 x86_64 x86_64 x86_64 GNU/Linux) The problem is now my MGS/MDS box cannot access/use the disks on that box. The single MDT volume mounts without error but I see the following messages in the MGS/MDS /var/log/messages file: Sep 29 10:02:40 mds1 kernel: Lustre: MDT crew3-MDT0000 now serving dev (crew3-MDT0000/be7a58cd-e259-823f-486b-e974551d7ad6) with recovery enabled Sep 29 10:02:40 mds1 kernel: Lustre: Server crew3-MDT0000 on device /dev/md0 has started Sep 29 10:02:40 mds1 kernel: Lustre: MDS crew3-MDT0000: crew3d1_UUID now active, resetting orphans Sep 29 10:02:40 mds1 kernel: Lustre: Skipped 2 previous similar messages Sep 29 10:03:29 mds1 kernel: LustreError: 26914:0:(mgs_handler.c:515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS Sep 29 10:03:29 mds1 kernel: LustreError: 26914:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-107) req at ffff81005986d450 x17040407/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl Interpret:/0/0 rc -107/0 Sep 29 10:03:29 mds1 kernel: LustreError: 26915:0:(mgs_handler.c:515:mgs_handle()) lustre_mgs: operation 501 on unconnected MGS Sep 29 10:03:29 mds1 kernel: LustreError: 26915:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-107) req at ffff810055ffac50 x17040408/t0 o501-><?>@<?>:-1 lens 200/0 ref 0 fl Interpret:/0/0 rc -107/0 The messages do not repeat. On MGS/MDS I also have [root at mds1 ~]# cat /proc/fs/lustre/mds/crew3-MDT0000/recovery_status status: INACTIVE The lctl can successfully ping the OSS and the OST appear correctly in lctl dl. The peers on the MGS/MDS (which is still successfully serving other disks) appears normal. [root at mds1 ~]# cat /proc/sys/lnet/peers nid refs state max rtr min tx min queue 0 at lo 1 ~rtr 0 0 0 0 0 0 172.16.0.15 at o2ib 1 ~rtr 8 8 8 8 7 0 172.18.1.1 at o2ib 1 ~rtr 8 8 8 8 -527 0 172.18.1.2 at o2ib 1 ~rtr 8 8 8 8 -261 0 172.18.0.9 at o2ib 1 ~rtr 8 8 8 8 7 0 172.18.0.10 at o2ib 1 ~rtr 8 8 8 8 6 0 172.18.0.11 at o2ib 1 ~rtr 8 8 8 8 -239 0 172.18.0.12 at o2ib 1 ~rtr 8 8 8 8 -2 0 172.18.0.13 at o2ib 1 ~rtr 8 8 8 8 -4 0 172.18.0.14 at o2ib 1 ~rtr 8 8 8 8 -4 0 172.18.0.15 at o2ib 1 ~rtr 8 8 8 8 -42 0 172.18.0.16 at o2ib 1 ~rtr 8 8 8 8 7 0 With this information I Google searched on the error and I found http://lustre.sev.net.ua/changeset/119/trunk/lustre. The page was timestamped 3/12/08 by Author shadow with the info below: trunk/lustre/ChangeLog r100 r119 Severity : major 16 Frequency : frequent on X2 node 17 Bugzilla : 15010 18 Description: mdc_set_open_replay_data LBUG 19 Details : Set replay data for requests that are eligible for replay. 20 21 Severity : normal 22 Bugzilla : 14321 23 Description: lustre_mgs: operation 101 on unconnected MGS 24 Details : When MGC is disconnected from MGS long enough, MGS will evict the 25 MGC, and late on MGC cannot successfully connect to MGS and a lot 26 of the error messages complaining that MGS is not connected. 27 28 Severity : major 16 29 Frequency : on start mds 17 30 Bugzilla : 14884 Okay. I am still running 2.6.18-53.1.13.el5_lustre.1.6.4.3smp, is there a way in which to get the MGS/MDS to once again access the OSTs associated with the MDT? The OSS box looks perfectly fine (minus one CPU). All the errors appear on the MGS/MDS box. The lustre disk will not mount on any of my clients. The message "mount.lustre: mount ic-mds1 at o2ib:/crew3 at /crew3 failed: Transport endpoint is not connected" is all that occurs. Suggestions and advice greatly appreciated. Do I just have to wait a long time to let the disk "find itself"? Using lctl device xx and activate did not help. megan
megan
2008-Sep-30 16:27 UTC
[Lustre-discuss] MGS/MDS error: operation 101 on unconnected MGS
More information: I tried rebooting my MGD/MDS to see if that would solve my inability to client-mount this disk problem. No, it did not. However, I did try the following that might offer some insight to a wise Lustre person. On the MGS/MDS I did the following: [root at mds1 tmp.BKUP]# mount -vv -t lustre ic-mds1 at o2ib:/crew3 / tmp.BKUP/crew3 arg[0] = /sbin/mount.lustre arg[1] = -v arg[2] = -o arg[3] = rw arg[4] = ic-mds1 at o2ib:/crew3 arg[5] = /tmp.BKUP/crew3 source = ic-mds1 at o2ib:/crew3 (172.18.0.10 at o2ib:/crew3), target = / tmp.BKUP/crew3 options = rw mounting device 172.18.0.10 at o2ib:/crew3 at /tmp.BKUP/crew3, flags=0 options=device=172.18.0.10 at o2ib:/crew3 warning: 172.18.0.10 at o2ib:/crew3: cannot resolve: No such file or directory/sbin/mount.lustre: unable to set tunables for 172.18.0.10 at o2ib:/crew3 (may cause reduced IO performance)[root at mds1 tmp.BKUP]# df -h Filesystem Size Used Avail Use% Mounted on /dev/sda2 29G 20G 7.5G 73% / /dev/sda1 145M 17M 121M 13% /boot tmpfs 1006M 0 1006M 0% /dev/shm /dev/sda5 42G 6.2G 34G 16% /srv/lustre_admin /dev/METADATA1/LV1 204G 5.3G 190G 3% /srv/lustre/mds/crew2- MDT0000 /dev/sdf 58G 2.5G 52G 5% /srv/lustre/mds/crew8- MDT0000 /dev/md0 204G 4.4G 188G 3% /srv/lustre/mds/crew3- MDT0000 ic-mds1 at o2ib:/crew3 19T 17T 1.8T 91% /tmp.BKUP/crew3 [root at mds1 tmp.BKUP]# ls /crew3 ls: /crew3: No such file or directory [root at mds1 tmp.BKUP]# cd /tmp.BKUP/crew3 [root at mds1 crew3]# ls data users [root at mds1 crew3]# ls users asahoo hongbo kristi larkoc qiaox roshan tugrul yluo [root at mds1 crew3]# cat /proc/fs/lustre/mds/crew3-MDT0000/ recovery_status status: INACTIVE Whoa! The files are there; the disk is there. The recovery_status is still "INACTIVE". Can this be correct? The disk seems usable on the (no user access *at all* MGS/MDS). So on a client I did very similar: [root at crew01 ~]# mount -vv -t lustre ic-mds1 at o2ib:/crew3 /crew3 arg[0] = /sbin/mount.lustre arg[1] = -v arg[2] = -o arg[3] = rw arg[4] = ic-mds1 at o2ib:/crew3 arg[5] = /crew3 source = ic-mds1 at o2ib:/crew3 (172.18.0.10 at o2ib:/crew3), target = / crew3 options = rw mounting device 172.18.0.10 at o2ib:/crew3 at /crew3, flags=0 options=device=172.18.0.10 at o2ib:/crew3 warning: 172.18.0.10 at o2ib:/crew3: cannot resolve: No such file or directory/sbin/mount.lustre: unable to set tunables for 172.18.0.10 at o2ib:/crew3 (may cause reduced IO performance)mount.lustre: mount ic-mds1 at o2ib:/crew3 at /crew3 failed: Transport endpoint is not connected Still the same error message "Transport endpoint is not connected". How can other disks on the same MGS/MDS and same IB switch work on the client and this one particular disk mount only on the MGS/MDS and not on any client? What do I need to do to fix it? Thank you, megan On Sep 29, 11:34?am, "Ms. Megan Larko" <dobsonu... at gmail.com> wrote:> Greetings, > > It''s Monday (sigh). ?I lost one dual-core Opteron 275 of two on my OSS > box over the weekend. ?The /var/log/messages contained many "bus error > on processor" messages. ?So Monday I rebooted the OSS with only one > dual core CPU. ?The box came up ?just fine and I mounted the three > lustre OST disks I have on that box. ?(CentOS 5 > 2.6.18-53.1.13.el5_lustre.1.6.4.3smp #1 SMP Sun Feb 17 08:38:44 EST > 2008 x86_64 x86_64 x86_64 GNU/Linux) > > The problem is now my MGS/MDS box cannot access/use the disks on that > box. ? The single MDT volume mounts without error but I see the > following messages in the MGS/MDS /var/log/messages file: > > Sep 29 10:02:40 mds1 kernel: Lustre: MDT crew3-MDT0000 now serving dev > (crew3-MDT0000/be7a58cd-e259-823f-486b-e974551d7ad6) with recovery > enabled > Sep 29 10:02:40 mds1 kernel: Lustre: Server crew3-MDT0000 on device > /dev/md0 has started > Sep 29 10:02:40 mds1 kernel: Lustre: MDS crew3-MDT0000: crew3d1_UUID > now active, resetting orphans > Sep 29 10:02:40 mds1 kernel: Lustre: Skipped 2 previous similar messages > Sep 29 10:03:29 mds1 kernel: LustreError: > 26914:0:(mgs_handler.c:515:mgs_handle()) lustre_mgs: operation 101 on > unconnected MGS > Sep 29 10:03:29 mds1 kernel: LustreError: > 26914:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error > (-107) ?req at ffff81005986d450 x17040407/t0 o101-><?>@<?>:-1 lens 232/0 > ref 0 fl Interpret:/0/0 rc -107/0 > Sep 29 10:03:29 mds1 kernel: LustreError: > 26915:0:(mgs_handler.c:515:mgs_handle()) lustre_mgs: operation 501 on > unconnected MGS > Sep 29 10:03:29 mds1 kernel: LustreError: > 26915:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error > (-107) ?req at ffff810055ffac50 x17040408/t0 o501-><?>@<?>:-1 lens 200/0 > ref 0 fl Interpret:/0/0 rc -107/0 > > The messages do not repeat. > > On MGS/MDS I also have [root at mds1 ~]# cat > /proc/fs/lustre/mds/crew3-MDT0000/recovery_status > status: INACTIVE > > The lctl can successfully ping the OSS and the OST appear correctly in > lctl dl. ?The peers on the MGS/MDS (which is still successfully > serving other disks) appears normal. > [root at mds1 ~]# cat /proc/sys/lnet/peers > nid ? ? ? ? ? ? ? ? ? ? ?refs state ? max ? rtr ? min ? ?tx ? min queue > 0 at lo ? ? ? ? ? ? ? ? ? ? ? ?1 ?~rtr ? ? 0 ? ? 0 ? ? 0 ? ? 0 ? ? 0 0 > 172.16.0.15 at o2ib ? ? ? ? ? ?1 ?~rtr ? ? 8 ? ? 8 ? ? 8 ? ? 8 ? ? 7 0 > 172.18.1.1 at o2ib ? ? ? ? ? ? 1 ?~rtr ? ? 8 ? ? 8 ? ? 8 ? ? 8 ?-527 0 > 172.18.1.2 at o2ib ? ? ? ? ? ? 1 ?~rtr ? ? 8 ? ? 8 ? ? 8 ? ? 8 ?-261 0 > 172.18.0.9 at o2ib ? ? ? ? ? ? 1 ?~rtr ? ? 8 ? ? 8 ? ? 8 ? ? 8 ? ? 7 0 > 172.18.0.10 at o2ib ? ? ? ? ? ?1 ?~rtr ? ? 8 ? ? 8 ? ? 8 ? ? 8 ? ? 6 0 > 172.18.0.11 at o2ib ? ? ? ? ? ?1 ?~rtr ? ? 8 ? ? 8 ? ? 8 ? ? 8 ?-239 0 > 172.18.0.12 at o2ib ? ? ? ? ? ?1 ?~rtr ? ? 8 ? ? 8 ? ? 8 ? ? 8 ? ?-2 0 > 172.18.0.13 at o2ib ? ? ? ? ? ?1 ?~rtr ? ? 8 ? ? 8 ? ? 8 ? ? 8 ? ?-4 0 > 172.18.0.14 at o2ib ? ? ? ? ? ?1 ?~rtr ? ? 8 ? ? 8 ? ? 8 ? ? 8 ? ?-4 0 > 172.18.0.15 at o2ib ? ? ? ? ? ?1 ?~rtr ? ? 8 ? ? 8 ? ? 8 ? ? 8 ? -42 0 > 172.18.0.16 at o2ib ? ? ? ? ? ?1 ?~rtr ? ? 8 ? ? 8 ? ? 8 ? ? 8 ? ? 7 0 > > With this information I Google searched on the error and I foundhttp://lustre.sev.net.ua/changeset/119/trunk/lustre. > The page was timestamped 3/12/08 by Author shadow with the info below: > > trunk/lustre/ChangeLog > r100 ? r119 > Severity ? : major > ? ? ? ? 16 ? ? ?Frequency ?: frequent on X2 node > ? ? ? ? 17 ? ? ?Bugzilla ? : 15010 > ? ? ? ? 18 ? ? ?Description: mdc_set_open_replay_data LBUG > ? ? ? ? 19 ? ? ?Details ? ?: Set replay data for requests that are eligible for replay. > ? ? ? ? 20 ? ? ? > ? ? ? ? 21 ? ? ?Severity ? : normal > ? ? ? ? 22 ? ? ?Bugzilla ? : 14321 > ? ? ? ? 23 ? ? ?Description: lustre_mgs: operation 101 on unconnected MGS > ? ? ? ? 24 ? ? ?Details ? ?: When MGC is disconnected from MGS long enough, MGS > will evict the > ? ? ? ? 25 ? ? ? ? ? ? ? ? ? MGC, and late on MGC cannot successfully connect to > MGS and a lot > ? ? ? ? 26 ? ? ? ? ? ? ? ? ? of the error messages complaining that MGS is not connected. > ? ? ? ? 27 ? ? ? > ? ? ? ? 28 ? ? ?Severity ? : major > 16 ? ? ?29 ? ? ?Frequency ?: on start mds > 17 ? ? ?30 ? ? ?Bugzilla ? : 14884 > > Okay. ?I am still running 2.6.18-53.1.13.el5_lustre.1.6.4.3smp, ?is > there a way in which to get the MGS/MDS to once again access the OSTs > associated with the MDT? > The OSS box looks perfectly fine (minus one CPU). ?All the errors > appear on the MGS/MDS box. ? The lustre disk will not mount on any of > my clients. ? The message "mount.lustre: mount ic-mds1 at o2ib:/crew3 at > /crew3 failed: Transport endpoint is not connected" ? is all that > occurs. > > Suggestions and advice greatly appreciated. ?Do I just have to wait a > long time to let the disk "find itself"? ? Using lctl device xx and > activate did not help. > > megan > _______________________________________________ > Lustre-discuss mailing list > Lustre-disc... at lists.lustre.orghttp://lists.lustre.org/mailman/listinfo/lustre-discuss
Wojciech Turek
2008-Sep-30 17:07 UTC
[Lustre-discuss] MGS/MDS error: operation 101 on unconnected MGS
Hi, Can you please run from the client server (crew01) following command line and paste the output here. lctl ping ic-mds1 at o2ib Cheers, Wojciech megan wrote:> More information: > > I tried rebooting my MGD/MDS to see if that would solve my inability > to client-mount this disk problem. > No, it did not. > > However, I did try the following that might offer some insight to a > wise Lustre person. On the MGS/MDS I did the following: > [root at mds1 tmp.BKUP]# mount -vv -t lustre ic-mds1 at o2ib:/crew3 / > tmp.BKUP/crew3 > arg[0] = /sbin/mount.lustre > arg[1] = -v > arg[2] = -o > arg[3] = rw > arg[4] = ic-mds1 at o2ib:/crew3 > arg[5] = /tmp.BKUP/crew3 > source = ic-mds1 at o2ib:/crew3 (172.18.0.10 at o2ib:/crew3), target = / > tmp.BKUP/crew3 > options = rw > mounting device 172.18.0.10 at o2ib:/crew3 at /tmp.BKUP/crew3, flags=0 > options=device=172.18.0.10 at o2ib:/crew3 > warning: 172.18.0.10 at o2ib:/crew3: cannot resolve: No such file or > directory/sbin/mount.lustre: unable to set tunables for > 172.18.0.10 at o2ib:/crew3 (may cause reduced IO performance)[root at mds1 > tmp.BKUP]# df -h > Filesystem Size Used Avail Use% Mounted on > /dev/sda2 29G 20G 7.5G 73% / > /dev/sda1 145M 17M 121M 13% /boot > tmpfs 1006M 0 1006M 0% /dev/shm > /dev/sda5 42G 6.2G 34G 16% /srv/lustre_admin > /dev/METADATA1/LV1 204G 5.3G 190G 3% /srv/lustre/mds/crew2- > MDT0000 > /dev/sdf 58G 2.5G 52G 5% /srv/lustre/mds/crew8- > MDT0000 > /dev/md0 204G 4.4G 188G 3% /srv/lustre/mds/crew3- > MDT0000 > ic-mds1 at o2ib:/crew3 19T 17T 1.8T 91% /tmp.BKUP/crew3 > [root at mds1 tmp.BKUP]# ls /crew3 > ls: /crew3: No such file or directory > [root at mds1 tmp.BKUP]# cd /tmp.BKUP/crew3 > [root at mds1 crew3]# ls > data users > [root at mds1 crew3]# ls users > asahoo hongbo kristi larkoc qiaox roshan tugrul yluo > [root at mds1 crew3]# cat /proc/fs/lustre/mds/crew3-MDT0000/ > recovery_status > status: INACTIVE > > Whoa! The files are there; the disk is there. The recovery_status > is still "INACTIVE". Can this be correct? The disk seems usable on > the (no user access *at all* MGS/MDS). > > So on a client I did very similar: > [root at crew01 ~]# mount -vv -t lustre ic-mds1 at o2ib:/crew3 /crew3 > arg[0] = /sbin/mount.lustre > arg[1] = -v > arg[2] = -o > arg[3] = rw > arg[4] = ic-mds1 at o2ib:/crew3 > arg[5] = /crew3 > source = ic-mds1 at o2ib:/crew3 (172.18.0.10 at o2ib:/crew3), target = / > crew3 > options = rw > mounting device 172.18.0.10 at o2ib:/crew3 at /crew3, flags=0 > options=device=172.18.0.10 at o2ib:/crew3 > warning: 172.18.0.10 at o2ib:/crew3: cannot resolve: No such file or > directory/sbin/mount.lustre: unable to set tunables for > 172.18.0.10 at o2ib:/crew3 (may cause reduced IO > performance)mount.lustre: mount ic-mds1 at o2ib:/crew3 at /crew3 failed: > Transport endpoint is not connected > > Still the same error message "Transport endpoint is not connected". > > How can other disks on the same MGS/MDS and same IB switch work on the > client and this one particular disk mount only on the MGS/MDS and not > on any client? > > What do I need to do to fix it? > > Thank you, > megan > > > On Sep 29, 11:34 am, "Ms. Megan Larko" <dobsonu... at gmail.com> wrote: > >> Greetings, >> >> It''s Monday (sigh). I lost one dual-core Opteron 275 of two on my OSS >> box over the weekend. The /var/log/messages contained many "bus error >> on processor" messages. So Monday I rebooted the OSS with only one >> dual core CPU. The box came up just fine and I mounted the three >> lustre OST disks I have on that box. (CentOS 5 >> 2.6.18-53.1.13.el5_lustre.1.6.4.3smp #1 SMP Sun Feb 17 08:38:44 EST >> 2008 x86_64 x86_64 x86_64 GNU/Linux) >> >> The problem is now my MGS/MDS box cannot access/use the disks on that >> box. The single MDT volume mounts without error but I see the >> following messages in the MGS/MDS /var/log/messages file: >> >> Sep 29 10:02:40 mds1 kernel: Lustre: MDT crew3-MDT0000 now serving dev >> (crew3-MDT0000/be7a58cd-e259-823f-486b-e974551d7ad6) with recovery >> enabled >> Sep 29 10:02:40 mds1 kernel: Lustre: Server crew3-MDT0000 on device >> /dev/md0 has started >> Sep 29 10:02:40 mds1 kernel: Lustre: MDS crew3-MDT0000: crew3d1_UUID >> now active, resetting orphans >> Sep 29 10:02:40 mds1 kernel: Lustre: Skipped 2 previous similar messages >> Sep 29 10:03:29 mds1 kernel: LustreError: >> 26914:0:(mgs_handler.c:515:mgs_handle()) lustre_mgs: operation 101 on >> unconnected MGS >> Sep 29 10:03:29 mds1 kernel: LustreError: >> 26914:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error >> (-107) req at ffff81005986d450 x17040407/t0 o101-><?>@<?>:-1 lens 232/0 >> ref 0 fl Interpret:/0/0 rc -107/0 >> Sep 29 10:03:29 mds1 kernel: LustreError: >> 26915:0:(mgs_handler.c:515:mgs_handle()) lustre_mgs: operation 501 on >> unconnected MGS >> Sep 29 10:03:29 mds1 kernel: LustreError: >> 26915:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error >> (-107) req at ffff810055ffac50 x17040408/t0 o501-><?>@<?>:-1 lens 200/0 >> ref 0 fl Interpret:/0/0 rc -107/0 >> >> The messages do not repeat. >> >> On MGS/MDS I also have [root at mds1 ~]# cat >> /proc/fs/lustre/mds/crew3-MDT0000/recovery_status >> status: INACTIVE >> >> The lctl can successfully ping the OSS and the OST appear correctly in >> lctl dl. The peers on the MGS/MDS (which is still successfully >> serving other disks) appears normal. >> [root at mds1 ~]# cat /proc/sys/lnet/peers >> nid refs state max rtr min tx min queue >> 0 at lo 1 ~rtr 0 0 0 0 0 0 >> 172.16.0.15 at o2ib 1 ~rtr 8 8 8 8 7 0 >> 172.18.1.1 at o2ib 1 ~rtr 8 8 8 8 -527 0 >> 172.18.1.2 at o2ib 1 ~rtr 8 8 8 8 -261 0 >> 172.18.0.9 at o2ib 1 ~rtr 8 8 8 8 7 0 >> 172.18.0.10 at o2ib 1 ~rtr 8 8 8 8 6 0 >> 172.18.0.11 at o2ib 1 ~rtr 8 8 8 8 -239 0 >> 172.18.0.12 at o2ib 1 ~rtr 8 8 8 8 -2 0 >> 172.18.0.13 at o2ib 1 ~rtr 8 8 8 8 -4 0 >> 172.18.0.14 at o2ib 1 ~rtr 8 8 8 8 -4 0 >> 172.18.0.15 at o2ib 1 ~rtr 8 8 8 8 -42 0 >> 172.18.0.16 at o2ib 1 ~rtr 8 8 8 8 7 0 >> >> With this information I Google searched on the error and I foundhttp://lustre.sev.net.ua/changeset/119/trunk/lustre. >> The page was timestamped 3/12/08 by Author shadow with the info below: >> >> trunk/lustre/ChangeLog >> r100 r119 >> Severity : major >> 16 Frequency : frequent on X2 node >> 17 Bugzilla : 15010 >> 18 Description: mdc_set_open_replay_data LBUG >> 19 Details : Set replay data for requests that are eligible for replay. >> 20 >> 21 Severity : normal >> 22 Bugzilla : 14321 >> 23 Description: lustre_mgs: operation 101 on unconnected MGS >> 24 Details : When MGC is disconnected from MGS long enough, MGS >> will evict the >> 25 MGC, and late on MGC cannot successfully connect to >> MGS and a lot >> 26 of the error messages complaining that MGS is not connected. >> 27 >> 28 Severity : major >> 16 29 Frequency : on start mds >> 17 30 Bugzilla : 14884 >> >> Okay. I am still running 2.6.18-53.1.13.el5_lustre.1.6.4.3smp, is >> there a way in which to get the MGS/MDS to once again access the OSTs >> associated with the MDT? >> The OSS box looks perfectly fine (minus one CPU). All the errors >> appear on the MGS/MDS box. The lustre disk will not mount on any of >> my clients. The message "mount.lustre: mount ic-mds1 at o2ib:/crew3 at >> /crew3 failed: Transport endpoint is not connected" is all that >> occurs. >> >> Suggestions and advice greatly appreciated. Do I just have to wait a >> long time to let the disk "find itself"? Using lctl device xx and >> activate did not help. >> >> megan >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-disc... at lists.lustre.orghttp://lists.lustre.org/mailman/listinfo/lustre-discuss >> > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- Wojciech Turek Assistant System Manager High Performance Computing Service University of Cambridge Email: wjt27 at cam.ac.uk Tel: (+)44 1223 763517