Samuel Aparicio
2011-Mar-21 03:41 UTC
[Lustre-discuss] persistent client re-connect failure
I am stuck with the following issue on a client attached to a lustre system. we are running lustre 1.8.5 somehow connectivity to the OST failed at some point and the mount hung. after unmounting and re-mounting the client attempts to reconnect. lctl ping shows the client to be connected and normal ping to the OSS/MGS servers shows connectivity. remounting the filesystem results in only some files being visible. the kernel messages are as follows: --------- Lustre: setting import lustre-OST0003_UUID INACTIVE by administrator request Lustre: lustre-OST0003-osc-ffff8110238c7400.osc: set parameter active=0 Lustre: Skipped 3 previous similar messages LustreError: 14114:0:(lov_obd.c:315:lov_connect_obd()) not connecting OSC ^\; administratively disabled Lustre: Client lustre-client has started LustreError: 14207:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO LustreError: 14207:0:(file.c:995:ll_glimpse_size()) Skipped 1 previous similar message LustreError: 14207:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO LustreError: 14686:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO Lustre: 22218:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request x1363662012007464 sent from lustre-OST0000-osc-ffff8110238c7400 to NID 10.9.89.21 at tcp 16s ago has timed out (16s prior to deadline). req at ffff810459ce4c00 x1363662012007464/t0 o8->lustre-OST0000_UUID at 10.9.89.21@tcp:28/4 lens 368/584 e 0 to 1 dl 1300678232 ref 1 fl Rpc:N/0/0 rc 0/0 Lustre: 22218:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 182 previous similar messages Lustre: 22219:0:(import.c:517:import_select_connection()) lustre-OST0000-osc-ffff8110238c7400: tried all connections, increasing latency to 18s Lustre: 22219:0:(import.c:517:import_select_connection()) Skipped 203 previous similar messages ------------ an LS of the filesytem shows drwxr-xr-x 4 amcpherson users 4096 Mar 19 10:38 amcpherson ?--------- ? ? ? ? ? compute-2-0-testwrite ?--------- ? ? ? ? ? hello ---------- other clients on the system are able to mount and see the files perfectly well. can anyone help with what the errors above imply. a simple network connectivity issue does not seem to be the case here, yet the client attempts to re-connect to the OST, fail. Professor Samuel Aparicio BM BCh PhD FRCPath Nan and Lorraine Robertson Chair UBC/BC Cancer Agency 675 West 10th, Vancouver V5Z 1L3, Canada. office: +1 604 675 8200 lab website http://molonc.bccrc.ca PLEASE SUPPORT MY FUNDRAISING FOR THE RIDE TO SEATTLE AND THE WEEKEND TO END WOMENS CANCERS. YOU CAN DONATE AT THE LINKS BELOW Ride to Seattle Fundraiser Weekend to End Womens Cancers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110320/1aadb1fd/attachment.html
Samuel Aparicio
2011-Mar-21 03:49 UTC
[Lustre-discuss] persistent client re-connect failure
Follow up to this posting. I notice on the client that lctl device_list reports the following: 0 UP mgc MGC10.9.89.51 at tcp 5a76b5b6-82bf-2053-8c17-e68ffe552edc 5 1 UP lov lustre-clilov-ffff8100459a9c00 6775de4c-6c29-9316-a715-3472233477d1 4 2 UP mdc lustre-MDT0000-mdc-ffff8100459a9c00 6775de4c-6c29-9316-a715-3472233477d1 5 3 UP osc lustre-OST0000-osc-ffff8100459a9c00 6775de4c-6c29-9316-a715-3472233477d1 5 4 UP osc lustre-OST0001-osc-ffff8100459a9c00 6775de4c-6c29-9316-a715-3472233477d1 5 5 UP osc lustre-OST0002-osc-ffff8100459a9c00 6775de4c-6c29-9316-a715-3472233477d1 5 6 UP osc lustre-OST0003-osc-ffff8100459a9c00 6775de4c-6c29-9316-a715-3472233477d1 4 7 UP osc lustre-OST0004-osc-ffff8100459a9c00 6775de4c-6c29-9316-a715-3472233477d1 5 8 UP osc lustre-OST0005-osc-ffff8100459a9c00 6775de4c-6c29-9316-a715-3472233477d1 5 9 UP osc lustre-OST0006-osc-ffff8100459a9c00 6775de4c-6c29-9316-a715-3472233477d1 5 10 UP lov lustre-clilov-ffff810c92f2b800 0ecd69f5-6793-fcb1-0e05-8851c99e5dc5 4 11 UP mdc lustre-MDT0000-mdc-ffff810c92f2b800 0ecd69f5-6793-fcb1-0e05-8851c99e5dc5 5 12 UP osc lustre-OST0000-osc-ffff810c92f2b800 0ecd69f5-6793-fcb1-0e05-8851c99e5dc5 5 13 UP osc lustre-OST0001-osc-ffff810c92f2b800 0ecd69f5-6793-fcb1-0e05-8851c99e5dc5 5 14 UP osc lustre-OST0002-osc-ffff810c92f2b800 0ecd69f5-6793-fcb1-0e05-8851c99e5dc5 5 15 UP osc lustre-OST0003-osc-ffff810c92f2b800 0ecd69f5-6793-fcb1-0e05-8851c99e5dc5 4 16 UP osc lustre-OST0004-osc-ffff810c92f2b800 0ecd69f5-6793-fcb1-0e05-8851c99e5dc5 5 17 UP osc lustre-OST0005-osc-ffff810c92f2b800 0ecd69f5-6793-fcb1-0e05-8851c99e5dc5 5 18 UP osc lustre-OST0006-osc-ffff810c92f2b800 0ecd69f5-6793-fcb1-0e05-8851c99e5dc5 5 19 UP lov lustre-clilov-ffff81047a45c000 6a3d5815-4851-31b0-9400-c8892e11dae4 4 20 UP mdc lustre-MDT0000-mdc-ffff81047a45c000 6a3d5815-4851-31b0-9400-c8892e11dae4 5 21 UP osc lustre-OST0000-osc-ffff81047a45c000 6a3d5815-4851-31b0-9400-c8892e11dae4 5 22 UP osc lustre-OST0001-osc-ffff81047a45c000 6a3d5815-4851-31b0-9400-c8892e11dae4 5 23 UP osc lustre-OST0002-osc-ffff81047a45c000 6a3d5815-4851-31b0-9400-c8892e11dae4 5 24 UP osc lustre-OST0003-osc-ffff81047a45c000 6a3d5815-4851-31b0-9400-c8892e11dae4 4 25 UP osc lustre-OST0004-osc-ffff81047a45c000 6a3d5815-4851-31b0-9400-c8892e11dae4 5 26 UP osc lustre-OST0005-osc-ffff81047a45c000 6a3d5815-4851-31b0-9400-c8892e11dae4 5 27 UP osc lustre-OST0006-osc-ffff81047a45c000 6a3d5815-4851-31b0-9400-c8892e11dae4 5 However OST3 is non-existent, it was de-activated on the MDS - why would the clients think it exists? Professor Samuel Aparicio BM BCh PhD FRCPath Nan and Lorraine Robertson Chair UBC/BC Cancer Agency 675 West 10th, Vancouver V5Z 1L3, Canada. office: +1 604 675 8200 lab website http://molonc.bccrc.ca PLEASE SUPPORT MY FUNDRAISING FOR THE RIDE TO SEATTLE AND THE WEEKEND TO END WOMENS CANCERS. YOU CAN DONATE AT THE LINKS BELOW Ride to Seattle Fundraiser Weekend to End Womens Cancers On Mar 20, 2011, at 8:41 PM, Samuel Aparicio wrote:> I am stuck with the following issue on a client attached to a lustre system. > we are running lustre 1.8.5 > somehow connectivity to the OST failed at some point and the mount hung. > after unmounting and re-mounting the client attempts to reconnect. > lctl ping shows the client to be connected and normal ping to the OSS/MGS servers shows connectivity. > > remounting the filesystem results in only some files being visible. > the kernel messages are as follows: > --------- > Lustre: setting import lustre-OST0003_UUID INACTIVE by administrator request > Lustre: lustre-OST0003-osc-ffff8110238c7400.osc: set parameter active=0 > Lustre: Skipped 3 previous similar messages > LustreError: 14114:0:(lov_obd.c:315:lov_connect_obd()) not connecting OSC ^\; administratively disabled > Lustre: Client lustre-client has started > LustreError: 14207:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO > LustreError: 14207:0:(file.c:995:ll_glimpse_size()) Skipped 1 previous similar message > LustreError: 14207:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO > LustreError: 14686:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO > Lustre: 22218:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request x1363662012007464 sent from lustre-OST0000-osc-ffff8110238c7400 to NID 10.9.89.21 at tcp 16s ago has timed out (16s prior to deadline). > req at ffff810459ce4c00 x1363662012007464/t0 o8->lustre-OST0000_UUID at 10.9.89.21@tcp:28/4 lens 368/584 e 0 to 1 dl 1300678232 ref 1 fl Rpc:N/0/0 rc 0/0 > Lustre: 22218:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 182 previous similar messages > Lustre: 22219:0:(import.c:517:import_select_connection()) lustre-OST0000-osc-ffff8110238c7400: tried all connections, increasing latency to 18s > Lustre: 22219:0:(import.c:517:import_select_connection()) Skipped 203 previous similar messages > ------------ > > an LS of the filesytem shows > > drwxr-xr-x 4 amcpherson users 4096 Mar 19 10:38 amcpherson > ?--------- ? ? ? ? ? compute-2-0-testwrite > ?--------- ? ? ? ? ? hello > > ---------- > > other clients on the system are able to mount and see the files perfectly well. > > can anyone help with what the errors above imply. > > a simple network connectivity issue does not seem to be the case here, > yet the client attempts to re-connect to the OST, fail. > > > > > > > > Professor Samuel Aparicio BM BCh PhD FRCPath > Nan and Lorraine Robertson Chair UBC/BC Cancer Agency > 675 West 10th, Vancouver V5Z 1L3, Canada. > office: +1 604 675 8200 lab website http://molonc.bccrc.ca > > PLEASE SUPPORT MY FUNDRAISING FOR THE RIDE TO SEATTLE AND > THE WEEKEND TO END WOMENS CANCERS. YOU CAN DONATE AT THE LINKS BELOW > Ride to Seattle Fundraiser > Weekend to End Womens Cancers > > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110320/d56ebd0f/attachment-0001.html
If you *only* deactivate it on mds, then you can still see the ost on client, just not to write on it anymore. On Mon, Mar 21, 2011 at 11:49 AM, Samuel Aparicio <saparicio at bccrc.ca> wrote:> Follow up to this posting. I notice on the client that lctl device_list > reports the following: > > ?0 UP mgc MGC10.9.89.51 at tcp 5a76b5b6-82bf-2053-8c17-e68ffe552edc 5 > ??1 UP lov lustre-clilov-ffff8100459a9c00 > 6775de4c-6c29-9316-a715-3472233477d1 4 > ??2 UP mdc lustre-MDT0000-mdc-ffff8100459a9c00 > 6775de4c-6c29-9316-a715-3472233477d1 5 > ??3 UP osc lustre-OST0000-osc-ffff8100459a9c00 > 6775de4c-6c29-9316-a715-3472233477d1 5 > ??4 UP osc lustre-OST0001-osc-ffff8100459a9c00 > 6775de4c-6c29-9316-a715-3472233477d1 5 > ??5 UP osc lustre-OST0002-osc-ffff8100459a9c00 > 6775de4c-6c29-9316-a715-3472233477d1 5 > ??6 UP osc lustre-OST0003-osc-ffff8100459a9c00 > 6775de4c-6c29-9316-a715-3472233477d1 4 > ??7 UP osc lustre-OST0004-osc-ffff8100459a9c00 > 6775de4c-6c29-9316-a715-3472233477d1 5 > ??8 UP osc lustre-OST0005-osc-ffff8100459a9c00 > 6775de4c-6c29-9316-a715-3472233477d1 5 > ??9 UP osc lustre-OST0006-osc-ffff8100459a9c00 > 6775de4c-6c29-9316-a715-3472233477d1 5 > ?10 UP lov lustre-clilov-ffff810c92f2b800 > 0ecd69f5-6793-fcb1-0e05-8851c99e5dc5 4 > ?11 UP mdc lustre-MDT0000-mdc-ffff810c92f2b800 > 0ecd69f5-6793-fcb1-0e05-8851c99e5dc5 5 > ?12 UP osc lustre-OST0000-osc-ffff810c92f2b800 > 0ecd69f5-6793-fcb1-0e05-8851c99e5dc5 5 > ?13 UP osc lustre-OST0001-osc-ffff810c92f2b800 > 0ecd69f5-6793-fcb1-0e05-8851c99e5dc5 5 > ?14 UP osc lustre-OST0002-osc-ffff810c92f2b800 > 0ecd69f5-6793-fcb1-0e05-8851c99e5dc5 5 > ?15 UP osc lustre-OST0003-osc-ffff810c92f2b800 > 0ecd69f5-6793-fcb1-0e05-8851c99e5dc5 4 > ?16 UP osc lustre-OST0004-osc-ffff810c92f2b800 > 0ecd69f5-6793-fcb1-0e05-8851c99e5dc5 5 > ?17 UP osc lustre-OST0005-osc-ffff810c92f2b800 > 0ecd69f5-6793-fcb1-0e05-8851c99e5dc5 5 > ?18 UP osc lustre-OST0006-osc-ffff810c92f2b800 > 0ecd69f5-6793-fcb1-0e05-8851c99e5dc5 5 > ?19 UP lov lustre-clilov-ffff81047a45c000 > 6a3d5815-4851-31b0-9400-c8892e11dae4 4 > ?20 UP mdc lustre-MDT0000-mdc-ffff81047a45c000 > 6a3d5815-4851-31b0-9400-c8892e11dae4 5 > ?21 UP osc lustre-OST0000-osc-ffff81047a45c000 > 6a3d5815-4851-31b0-9400-c8892e11dae4 5 > ?22 UP osc lustre-OST0001-osc-ffff81047a45c000 > 6a3d5815-4851-31b0-9400-c8892e11dae4 5 > ?23 UP osc lustre-OST0002-osc-ffff81047a45c000 > 6a3d5815-4851-31b0-9400-c8892e11dae4 5 > ?24 UP osc lustre-OST0003-osc-ffff81047a45c000 > 6a3d5815-4851-31b0-9400-c8892e11dae4 4 > ?25 UP osc lustre-OST0004-osc-ffff81047a45c000 > 6a3d5815-4851-31b0-9400-c8892e11dae4 5 > ?26 UP osc lustre-OST0005-osc-ffff81047a45c000 > 6a3d5815-4851-31b0-9400-c8892e11dae4 5 > ?27 UP osc lustre-OST0006-osc-ffff81047a45c000 > 6a3d5815-4851-31b0-9400-c8892e11dae4 5 > > However OST3 is non-existent, it was de-activated on the MDS - why would the > clients think it exists? > > > > > > > > > > > > > > > > Professor Samuel Aparicio BM BCh PhD FRCPath > Nan and Lorraine Robertson Chair UBC/BC Cancer?Agency > 675 West 10th, Vancouver V5Z 1L3, Canada. > office: +1 604 675 8200 lab website?http://molonc.bccrc.ca > > PLEASE SUPPORT MY FUNDRAISING FOR THE RIDE?TO SEATTLE AND > THE WEEKEND TO END WOMENS CANCERS. YOU?CAN DONATE AT THE LINKS BELOW > Ride to Seattle Fundraiser > Weekend to End Womens Cancers > > > > > On Mar 20, 2011, at 8:41 PM, Samuel Aparicio wrote: > > I am stuck with the following issue on a client attached to a lustre system. > we are running lustre 1.8.5 > somehow connectivity to the OST failed at some point and the mount hung. > after unmounting and re-mounting the client attempts to reconnect. > lctl ping shows the client to be connected and normal ping to the OSS/MGS > servers shows connectivity. > remounting the filesystem results in only some files being visible. > the kernel messages are as follows: > --------- > Lustre: setting import lustre-OST0003_UUID INACTIVE by administrator request > Lustre: lustre-OST0003-osc-ffff8110238c7400.osc: set parameter active=0 > Lustre: Skipped 3 previous similar messages > LustreError: 14114:0:(lov_obd.c:315:lov_connect_obd()) not connecting OSC > ^\; administratively disabled > Lustre: Client lustre-client has started > LustreError: 14207:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc > -5, returning -EIO > LustreError: 14207:0:(file.c:995:ll_glimpse_size()) Skipped 1 previous > similar message > LustreError: 14207:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc > -5, returning -EIO > LustreError: 14686:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc > -5, returning -EIO > Lustre: 22218:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request > x1363662012007464 sent from lustre-OST0000-osc-ffff8110238c7400 to NID > 10.9.89.21 at tcp 16s ago has timed out (16s prior to deadline). > ??req at ffff810459ce4c00 x1363662012007464/t0 > o8->lustre-OST0000_UUID at 10.9.89.21@tcp:28/4 lens 368/584 e 0 to 1 dl > 1300678232 ref 1 fl Rpc:N/0/0 rc 0/0 > Lustre: 22218:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 182 > previous similar messages > Lustre: 22219:0:(import.c:517:import_select_connection()) > lustre-OST0000-osc-ffff8110238c7400: tried all connections, increasing > latency to 18s > Lustre: 22219:0:(import.c:517:import_select_connection()) Skipped 203 > previous similar messages > ------------ > an LS of the filesytem shows > drwxr-xr-x 4 amcpherson users 4096 Mar 19 10:38 amcpherson > ?--------- ? ? ? ? ? ? ?? ? ? ? ?? ? ? ? ? ? ?? compute-2-0-testwrite > ?--------- ? ? ? ? ? ? ?? ? ? ? ?? ? ? ? ? ? ?? hello > ---------- > other clients on the system are able to mount and see the files perfectly > well. > can anyone help with what the errors above imply. > a simple network connectivity issue does not seem to be the case here, > yet the client attempts to re-connect to the OST, fail. > > > > > > > Professor Samuel Aparicio BM BCh PhD FRCPath > Nan and Lorraine Robertson Chair UBC/BC Cancer?Agency > 675 West 10th, Vancouver V5Z 1L3, Canada. > office: +1 604 675 8200 lab website?http://molonc.bccrc.ca > > PLEASE SUPPORT MY FUNDRAISING FOR THE RIDE?TO SEATTLE AND > THE WEEKEND TO END WOMENS CANCERS. YOU?CAN DONATE AT THE LINKS BELOW > Ride to Seattle Fundraiser > Weekend to End Womens Cancers > > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
Samuel Aparicio
2011-Mar-21 17:37 UTC
[Lustre-discuss] persistent client re-connect failure
it was permanently inactivated on the mds ... strange that it should show up at all. the OST list is persistent through the history of additions/deletions ...? Professor Samuel Aparicio BM BCh PhD FRCPath Nan and Lorraine Robertson Chair UBC/BC Cancer Agency 675 West 10th, Vancouver V5Z 1L3, Canada. office: +1 604 675 8200 lab website http://molonc.bccrc.ca PLEASE SUPPORT MY FUNDRAISING FOR THE RIDE TO SEATTLE AND THE WEEKEND TO END WOMENS CANCERS. YOU CAN DONATE AT THE LINKS BELOW Ride to Seattle Fundraiser Weekend to End Womens Cancers On Mar 21, 2011, at 3:29 AM, Larry wrote:> If you *only* deactivate it on mds, then you can still see the ost on > client, just not to write on it anymore. > > On Mon, Mar 21, 2011 at 11:49 AM, Samuel Aparicio <saparicio at bccrc.ca> wrote: >> Follow up to this posting. I notice on the client that lctl device_list >> reports the following: >> >> 0 UP mgc MGC10.9.89.51 at tcp 5a76b5b6-82bf-2053-8c17-e68ffe552edc 5 >> 1 UP lov lustre-clilov-ffff8100459a9c00 >> 6775de4c-6c29-9316-a715-3472233477d1 4 >> 2 UP mdc lustre-MDT0000-mdc-ffff8100459a9c00 >> 6775de4c-6c29-9316-a715-3472233477d1 5 >> 3 UP osc lustre-OST0000-osc-ffff8100459a9c00 >> 6775de4c-6c29-9316-a715-3472233477d1 5 >> 4 UP osc lustre-OST0001-osc-ffff8100459a9c00 >> 6775de4c-6c29-9316-a715-3472233477d1 5 >> 5 UP osc lustre-OST0002-osc-ffff8100459a9c00 >> 6775de4c-6c29-9316-a715-3472233477d1 5 >> 6 UP osc lustre-OST0003-osc-ffff8100459a9c00 >> 6775de4c-6c29-9316-a715-3472233477d1 4 >> 7 UP osc lustre-OST0004-osc-ffff8100459a9c00 >> 6775de4c-6c29-9316-a715-3472233477d1 5 >> 8 UP osc lustre-OST0005-osc-ffff8100459a9c00 >> 6775de4c-6c29-9316-a715-3472233477d1 5 >> 9 UP osc lustre-OST0006-osc-ffff8100459a9c00 >> 6775de4c-6c29-9316-a715-3472233477d1 5 >> 10 UP lov lustre-clilov-ffff810c92f2b800 >> 0ecd69f5-6793-fcb1-0e05-8851c99e5dc5 4 >> 11 UP mdc lustre-MDT0000-mdc-ffff810c92f2b800 >> 0ecd69f5-6793-fcb1-0e05-8851c99e5dc5 5 >> 12 UP osc lustre-OST0000-osc-ffff810c92f2b800 >> 0ecd69f5-6793-fcb1-0e05-8851c99e5dc5 5 >> 13 UP osc lustre-OST0001-osc-ffff810c92f2b800 >> 0ecd69f5-6793-fcb1-0e05-8851c99e5dc5 5 >> 14 UP osc lustre-OST0002-osc-ffff810c92f2b800 >> 0ecd69f5-6793-fcb1-0e05-8851c99e5dc5 5 >> 15 UP osc lustre-OST0003-osc-ffff810c92f2b800 >> 0ecd69f5-6793-fcb1-0e05-8851c99e5dc5 4 >> 16 UP osc lustre-OST0004-osc-ffff810c92f2b800 >> 0ecd69f5-6793-fcb1-0e05-8851c99e5dc5 5 >> 17 UP osc lustre-OST0005-osc-ffff810c92f2b800 >> 0ecd69f5-6793-fcb1-0e05-8851c99e5dc5 5 >> 18 UP osc lustre-OST0006-osc-ffff810c92f2b800 >> 0ecd69f5-6793-fcb1-0e05-8851c99e5dc5 5 >> 19 UP lov lustre-clilov-ffff81047a45c000 >> 6a3d5815-4851-31b0-9400-c8892e11dae4 4 >> 20 UP mdc lustre-MDT0000-mdc-ffff81047a45c000 >> 6a3d5815-4851-31b0-9400-c8892e11dae4 5 >> 21 UP osc lustre-OST0000-osc-ffff81047a45c000 >> 6a3d5815-4851-31b0-9400-c8892e11dae4 5 >> 22 UP osc lustre-OST0001-osc-ffff81047a45c000 >> 6a3d5815-4851-31b0-9400-c8892e11dae4 5 >> 23 UP osc lustre-OST0002-osc-ffff81047a45c000 >> 6a3d5815-4851-31b0-9400-c8892e11dae4 5 >> 24 UP osc lustre-OST0003-osc-ffff81047a45c000 >> 6a3d5815-4851-31b0-9400-c8892e11dae4 4 >> 25 UP osc lustre-OST0004-osc-ffff81047a45c000 >> 6a3d5815-4851-31b0-9400-c8892e11dae4 5 >> 26 UP osc lustre-OST0005-osc-ffff81047a45c000 >> 6a3d5815-4851-31b0-9400-c8892e11dae4 5 >> 27 UP osc lustre-OST0006-osc-ffff81047a45c000 >> 6a3d5815-4851-31b0-9400-c8892e11dae4 5 >> >> However OST3 is non-existent, it was de-activated on the MDS - why would the >> clients think it exists? >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> Professor Samuel Aparicio BM BCh PhD FRCPath >> Nan and Lorraine Robertson Chair UBC/BC Cancer Agency >> 675 West 10th, Vancouver V5Z 1L3, Canada. >> office: +1 604 675 8200 lab website http://molonc.bccrc.ca >> >> PLEASE SUPPORT MY FUNDRAISING FOR THE RIDE TO SEATTLE AND >> THE WEEKEND TO END WOMENS CANCERS. YOU CAN DONATE AT THE LINKS BELOW >> Ride to Seattle Fundraiser >> Weekend to End Womens Cancers >> >> >> >> >> On Mar 20, 2011, at 8:41 PM, Samuel Aparicio wrote: >> >> I am stuck with the following issue on a client attached to a lustre system. >> we are running lustre 1.8.5 >> somehow connectivity to the OST failed at some point and the mount hung. >> after unmounting and re-mounting the client attempts to reconnect. >> lctl ping shows the client to be connected and normal ping to the OSS/MGS >> servers shows connectivity. >> remounting the filesystem results in only some files being visible. >> the kernel messages are as follows: >> --------- >> Lustre: setting import lustre-OST0003_UUID INACTIVE by administrator request >> Lustre: lustre-OST0003-osc-ffff8110238c7400.osc: set parameter active=0 >> Lustre: Skipped 3 previous similar messages >> LustreError: 14114:0:(lov_obd.c:315:lov_connect_obd()) not connecting OSC >> ^\; administratively disabled >> Lustre: Client lustre-client has started >> LustreError: 14207:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc >> -5, returning -EIO >> LustreError: 14207:0:(file.c:995:ll_glimpse_size()) Skipped 1 previous >> similar message >> LustreError: 14207:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc >> -5, returning -EIO >> LustreError: 14686:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc >> -5, returning -EIO >> Lustre: 22218:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request >> x1363662012007464 sent from lustre-OST0000-osc-ffff8110238c7400 to NID >> 10.9.89.21 at tcp 16s ago has timed out (16s prior to deadline). >> req at ffff810459ce4c00 x1363662012007464/t0 >> o8->lustre-OST0000_UUID at 10.9.89.21@tcp:28/4 lens 368/584 e 0 to 1 dl >> 1300678232 ref 1 fl Rpc:N/0/0 rc 0/0 >> Lustre: 22218:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 182 >> previous similar messages >> Lustre: 22219:0:(import.c:517:import_select_connection()) >> lustre-OST0000-osc-ffff8110238c7400: tried all connections, increasing >> latency to 18s >> Lustre: 22219:0:(import.c:517:import_select_connection()) Skipped 203 >> previous similar messages >> ------------ >> an LS of the filesytem shows >> drwxr-xr-x 4 amcpherson users 4096 Mar 19 10:38 amcpherson >> ?--------- ? ? ? ? ? compute-2-0-testwrite >> ?--------- ? ? ? ? ? hello >> ---------- >> other clients on the system are able to mount and see the files perfectly >> well. >> can anyone help with what the errors above imply. >> a simple network connectivity issue does not seem to be the case here, >> yet the client attempts to re-connect to the OST, fail. >> >> >> >> >> >> >> Professor Samuel Aparicio BM BCh PhD FRCPath >> Nan and Lorraine Robertson Chair UBC/BC Cancer Agency >> 675 West 10th, Vancouver V5Z 1L3, Canada. >> office: +1 604 675 8200 lab website http://molonc.bccrc.ca >> >> PLEASE SUPPORT MY FUNDRAISING FOR THE RIDE TO SEATTLE AND >> THE WEEKEND TO END WOMENS CANCERS. YOU CAN DONATE AT THE LINKS BELOW >> Ride to Seattle Fundraiser >> Weekend to End Womens Cancers >> >> >> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >>-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110321/e19c37d6/attachment-0001.html
Samuel Aparicio
2011-Mar-21 17:39 UTC
[Lustre-discuss] persistent client re-connect failure
follow up - rebooting the client fixed this issue - I could not remove the kernel modules (lustre_rmmod) and restart lnet even though the filesystem was unmounted, presumably because there was still some transaction trying to be played out. is there a better way to do this? sam a. On Mar 20, 2011, at 8:41 PM, Samuel Aparicio wrote:> I am stuck with the following issue on a client attached to a lustre system. > we are running lustre 1.8.5 > somehow connectivity to the OST failed at some point and the mount hung. > after unmounting and re-mounting the client attempts to reconnect. > lctl ping shows the client to be connected and normal ping to the OSS/MGS servers shows connectivity. > > remounting the filesystem results in only some files being visible. > the kernel messages are as follows: > --------- > Lustre: setting import lustre-OST0003_UUID INACTIVE by administrator request > Lustre: lustre-OST0003-osc-ffff8110238c7400.osc: set parameter active=0 > Lustre: Skipped 3 previous similar messages > LustreError: 14114:0:(lov_obd.c:315:lov_connect_obd()) not connecting OSC ^\; administratively disabled > Lustre: Client lustre-client has started > LustreError: 14207:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO > LustreError: 14207:0:(file.c:995:ll_glimpse_size()) Skipped 1 previous similar message > LustreError: 14207:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO > LustreError: 14686:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO > Lustre: 22218:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request x1363662012007464 sent from lustre-OST0000-osc-ffff8110238c7400 to NID 10.9.89.21 at tcp 16s ago has timed out (16s prior to deadline). > req at ffff810459ce4c00 x1363662012007464/t0 o8->lustre-OST0000_UUID at 10.9.89.21@tcp:28/4 lens 368/584 e 0 to 1 dl 1300678232 ref 1 fl Rpc:N/0/0 rc 0/0 > Lustre: 22218:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 182 previous similar messages > Lustre: 22219:0:(import.c:517:import_select_connection()) lustre-OST0000-osc-ffff8110238c7400: tried all connections, increasing latency to 18s > Lustre: 22219:0:(import.c:517:import_select_connection()) Skipped 203 previous similar messages > ------------ > > an LS of the filesytem shows > > drwxr-xr-x 4 amcpherson users 4096 Mar 19 10:38 amcpherson > ?--------- ? ? ? ? ? compute-2-0-testwrite > ?--------- ? ? ? ? ? hello > > ---------- > > other clients on the system are able to mount and see the files perfectly well. > > can anyone help with what the errors above imply. > > a simple network connectivity issue does not seem to be the case here, > yet the client attempts to re-connect to the OST, fail. > > > > > > > > Professor Samuel Aparicio BM BCh PhD FRCPath > Nan and Lorraine Robertson Chair UBC/BC Cancer Agency > 675 West 10th, Vancouver V5Z 1L3, Canada. > office: +1 604 675 8200 lab website http://molonc.bccrc.ca > > PLEASE SUPPORT MY FUNDRAISING FOR THE RIDE TO SEATTLE AND > THE WEEKEND TO END WOMENS CANCERS. YOU CAN DONATE AT THE LINKS BELOW > Ride to Seattle Fundraiser > Weekend to End Womens Cancers > > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110321/8ba9f235/attachment.html
Andreas Dilger
2011-Mar-21 17:59 UTC
[Lustre-discuss] persistent client re-connect failure
On 2011-03-21, at 6:37 PM, Samuel Aparicio wrote:> it was permanently inactivated on the mds ... strange that it should show up at all. the OST list is persistent through the history of additions/deletions ...?Yes, the "removed" OST will continue to be listed (for good or bad). It is really just "permanent deactivation".> On Mar 21, 2011, at 3:29 AM, Larry wrote: > >> If you *only* deactivate it on mds, then you can still see the ost on >> client, just not to write on it anymore. >> >> On Mon, Mar 21, 2011 at 11:49 AM, Samuel Aparicio <saparicio at bccrc.ca> wrote: >>> Follow up to this posting. I notice on the client that lctl device_list >>> reports the following: >>> >>> 0 UP mgc MGC10.9.89.51 at tcp 5a76b5b6-82bf-2053-8c17-e68ffe552edc 5 >>> 1 UP lov lustre-clilov-ffff8100459a9c00 >>> 6775de4c-6c29-9316-a715-3472233477d1 4 >>> 2 UP mdc lustre-MDT0000-mdc-ffff8100459a9c00 >>> 6775de4c-6c29-9316-a715-3472233477d1 5 >>> 3 UP osc lustre-OST0000-osc-ffff8100459a9c00 >>> 6775de4c-6c29-9316-a715-3472233477d1 5 >>> 4 UP osc lustre-OST0001-osc-ffff8100459a9c00 >>> 6775de4c-6c29-9316-a715-3472233477d1 5 >>> 5 UP osc lustre-OST0002-osc-ffff8100459a9c00 >>> 6775de4c-6c29-9316-a715-3472233477d1 5 >>> 6 UP osc lustre-OST0003-osc-ffff8100459a9c00 >>> 6775de4c-6c29-9316-a715-3472233477d1 4 >>> 7 UP osc lustre-OST0004-osc-ffff8100459a9c00 >>> 6775de4c-6c29-9316-a715-3472233477d1 5 >>> 8 UP osc lustre-OST0005-osc-ffff8100459a9c00 >>> 6775de4c-6c29-9316-a715-3472233477d1 5 >>> 9 UP osc lustre-OST0006-osc-ffff8100459a9c00 >>> 6775de4c-6c29-9316-a715-3472233477d1 5 >>> 10 UP lov lustre-clilov-ffff810c92f2b800 >>> 0ecd69f5-6793-fcb1-0e05-8851c99e5dc5 4 >>> 11 UP mdc lustre-MDT0000-mdc-ffff810c92f2b800 >>> 0ecd69f5-6793-fcb1-0e05-8851c99e5dc5 5 >>> 12 UP osc lustre-OST0000-osc-ffff810c92f2b800 >>> 0ecd69f5-6793-fcb1-0e05-8851c99e5dc5 5 >>> 13 UP osc lustre-OST0001-osc-ffff810c92f2b800 >>> 0ecd69f5-6793-fcb1-0e05-8851c99e5dc5 5 >>> 14 UP osc lustre-OST0002-osc-ffff810c92f2b800 >>> 0ecd69f5-6793-fcb1-0e05-8851c99e5dc5 5 >>> 15 UP osc lustre-OST0003-osc-ffff810c92f2b800 >>> 0ecd69f5-6793-fcb1-0e05-8851c99e5dc5 4 >>> 16 UP osc lustre-OST0004-osc-ffff810c92f2b800 >>> 0ecd69f5-6793-fcb1-0e05-8851c99e5dc5 5 >>> 17 UP osc lustre-OST0005-osc-ffff810c92f2b800 >>> 0ecd69f5-6793-fcb1-0e05-8851c99e5dc5 5 >>> 18 UP osc lustre-OST0006-osc-ffff810c92f2b800 >>> 0ecd69f5-6793-fcb1-0e05-8851c99e5dc5 5 >>> 19 UP lov lustre-clilov-ffff81047a45c000 >>> 6a3d5815-4851-31b0-9400-c8892e11dae4 4 >>> 20 UP mdc lustre-MDT0000-mdc-ffff81047a45c000 >>> 6a3d5815-4851-31b0-9400-c8892e11dae4 5 >>> 21 UP osc lustre-OST0000-osc-ffff81047a45c000 >>> 6a3d5815-4851-31b0-9400-c8892e11dae4 5 >>> 22 UP osc lustre-OST0001-osc-ffff81047a45c000 >>> 6a3d5815-4851-31b0-9400-c8892e11dae4 5 >>> 23 UP osc lustre-OST0002-osc-ffff81047a45c000 >>> 6a3d5815-4851-31b0-9400-c8892e11dae4 5 >>> 24 UP osc lustre-OST0003-osc-ffff81047a45c000 >>> 6a3d5815-4851-31b0-9400-c8892e11dae4 4 >>> 25 UP osc lustre-OST0004-osc-ffff81047a45c000 >>> 6a3d5815-4851-31b0-9400-c8892e11dae4 5 >>> 26 UP osc lustre-OST0005-osc-ffff81047a45c000 >>> 6a3d5815-4851-31b0-9400-c8892e11dae4 5 >>> 27 UP osc lustre-OST0006-osc-ffff81047a45c000 >>> 6a3d5815-4851-31b0-9400-c8892e11dae4 5 >>> >>> However OST3 is non-existent, it was de-activated on the MDS - why would the >>> clients think it exists? >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> Professor Samuel Aparicio BM BCh PhD FRCPath >>> Nan and Lorraine Robertson Chair UBC/BC Cancer Agency >>> 675 West 10th, Vancouver V5Z 1L3, Canada. >>> office: +1 604 675 8200 lab website http://molonc.bccrc.ca >>> >>> PLEASE SUPPORT MY FUNDRAISING FOR THE RIDE TO SEATTLE AND >>> THE WEEKEND TO END WOMENS CANCERS. YOU CAN DONATE AT THE LINKS BELOW >>> Ride to Seattle Fundraiser >>> Weekend to End Womens Cancers >>> >>> >>> >>> >>> On Mar 20, 2011, at 8:41 PM, Samuel Aparicio wrote: >>> >>> I am stuck with the following issue on a client attached to a lustre system. >>> we are running lustre 1.8.5 >>> somehow connectivity to the OST failed at some point and the mount hung. >>> after unmounting and re-mounting the client attempts to reconnect. >>> lctl ping shows the client to be connected and normal ping to the OSS/MGS >>> servers shows connectivity. >>> remounting the filesystem results in only some files being visible. >>> the kernel messages are as follows: >>> --------- >>> Lustre: setting import lustre-OST0003_UUID INACTIVE by administrator request >>> Lustre: lustre-OST0003-osc-ffff8110238c7400.osc: set parameter active=0 >>> Lustre: Skipped 3 previous similar messages >>> LustreError: 14114:0:(lov_obd.c:315:lov_connect_obd()) not connecting OSC >>> ^\; administratively disabled >>> Lustre: Client lustre-client has started >>> LustreError: 14207:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc >>> -5, returning -EIO >>> LustreError: 14207:0:(file.c:995:ll_glimpse_size()) Skipped 1 previous >>> similar message >>> LustreError: 14207:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc >>> -5, returning -EIO >>> LustreError: 14686:0:(file.c:995:ll_glimpse_size()) obd_enqueue returned rc >>> -5, returning -EIO >>> Lustre: 22218:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request >>> x1363662012007464 sent from lustre-OST0000-osc-ffff8110238c7400 to NID >>> 10.9.89.21 at tcp 16s ago has timed out (16s prior to deadline). >>> req at ffff810459ce4c00 x1363662012007464/t0 >>> o8->lustre-OST0000_UUID at 10.9.89.21@tcp:28/4 lens 368/584 e 0 to 1 dl >>> 1300678232 ref 1 fl Rpc:N/0/0 rc 0/0 >>> Lustre: 22218:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 182 >>> previous similar messages >>> Lustre: 22219:0:(import.c:517:import_select_connection()) >>> lustre-OST0000-osc-ffff8110238c7400: tried all connections, increasing >>> latency to 18s >>> Lustre: 22219:0:(import.c:517:import_select_connection()) Skipped 203 >>> previous similar messages >>> ------------ >>> an LS of the filesytem shows >>> drwxr-xr-x 4 amcpherson users 4096 Mar 19 10:38 amcpherson >>> ?--------- ? ? ? ? ? compute-2-0-testwrite >>> ?--------- ? ? ? ? ? hello >>> ---------- >>> other clients on the system are able to mount and see the files perfectly >>> well. >>> can anyone help with what the errors above imply. >>> a simple network connectivity issue does not seem to be the case here, >>> yet the client attempts to re-connect to the OST, fail. >>> >>> >>> >>> >>> >>> >>> Professor Samuel Aparicio BM BCh PhD FRCPath >>> Nan and Lorraine Robertson Chair UBC/BC Cancer Agency >>> 675 West 10th, Vancouver V5Z 1L3, Canada. >>> office: +1 604 675 8200 lab website http://molonc.bccrc.ca >>> >>> PLEASE SUPPORT MY FUNDRAISING FOR THE RIDE TO SEATTLE AND >>> THE WEEKEND TO END WOMENS CANCERS. YOU CAN DONATE AT THE LINKS BELOW >>> Ride to Seattle Fundraiser >>> Weekend to End Womens Cancers >>> >>> >>> >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >>> > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussCheers, Andreas -- Andreas Dilger Principal Engineer Whamcloud, Inc.