Somsak Sriprayoonsakul
2006-Nov-07 05:29 UTC
[Lustre-discuss] Can''t mount lustre on some nodes
Dear List, I''m trying to set up a Lustre 1.6b5 cluster where every nodes except frontend serve OST, frontend serve MGS+MDT, and every nodes (including frontend) mount and use Lustre. Somehow there''s a weird problem where some nodes can''t mount lustre but some nodes can. My configuration: OS: Rocks 4.2.1 Cluster (CentOS 4.4) using stock lustre 2.6.9-42.EL_lustre.1.5.95smp kernel. Frontend has 2 IP (real + private) and ever compute nodes using private IP. Lustre: 1.6b5. Here''s log from frontend (MGS+MDT) and the failed client node Failed client node: Lustre: mount data: Lustre: profile: lustre-client Lustre: device: 10.1.1.1@tcp:/lustre Lustre: flags: 2 LustreError: 22040:0:(client.c:579:ptlrpc_check_status()) @@@ type == PTL_RPC_MSG_ERR, err == -107 LustreError: 22040:0:(client.c:579:ptlrpc_check_status()) Skipped 3 previous similar messages LustreError: 22040:0:(mgc_request.c:964:mgc_process_log()) Can''t get cfg lock: -107 LustreError: 3099:0:(mgc_request.c:493:mgc_blocking_ast()) original grant failed, won''t requeue LustreError: 22040:0:(mgc_request.c:1014:mgc_process_log()) MGC10.1.1.1@tcp: the configuration ''lustre-client'' could not be read (-107) from the MGS. LustreError: MGC10.1.1.1@tcp: The configuration ''lustre-client'' could not be read from the MGS (-107). This may be the result of communication errors between this node and the MGS, or the MGS may not be running. Lustre: 0 UP mgc MGC10.1.1.1@tcp f19e61f7-623f-55a2-6332-ea987600d10d 5 Lustre: 1 UP ost OSS OSS_uuid 3 Lustre: 2 UP obdfilter lustre-OST0001 lustre-OST0001_UUID 9 LustreError: 22040:0:(llite_lib.c:909:ll_fill_super()) Unable to process log: -107 Lustre: client 0000010118688000 umount complete LustreError: 22040:0:(obd_mount.c:1857:lustre_fill_super()) Unable to mount (-107) Frontend: LustreError: 10490:0:(mgs_handler.c:468:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS LustreError: 10490:0:(mgs_handler.c:468:mgs_handle()) Skipped 1 previous similar message LustreError: 10490:0:(ldlm_lib.c:1317:target_send_reply_msg()) @@@ processing error (-107) LustreError: 10490:0:(ldlm_lib.c:1317:target_send_reply_msg()) Skipped 3 previous similar messages I think I strictly follow the guide at https://mail.clusterfs.com/wikis/lustre/MountConf. I suppose that the problem occurred because IP confusion on Frontend. But some compute nodes successfully mount lustre file system frontend. Regards, -- ----------------------------------------------------------------------------------- Somsak Sriprayoonsakul Thai National Grid Center Software Industry Promotion Agency Ministry of ICT, Thailand somsak_sr@thaigrid.or.th -----------------------------------------------------------------------------------
Use "lctl list_nids" and "lctl ping <remote_nid>" on the clients and servers to help see where the problem is. Somsak Sriprayoonsakul wrote:> Dear List, > > I''m trying to set up a Lustre 1.6b5 cluster where every nodes > except frontend serve OST, frontend serve MGS+MDT, and every nodes > (including frontend) mount and use Lustre. Somehow there''s a weird > problem where some nodes can''t mount lustre but some nodes can. > > My configuration: > > OS: Rocks 4.2.1 Cluster (CentOS 4.4) using stock lustre > 2.6.9-42.EL_lustre.1.5.95smp kernel. Frontend has 2 IP (real + > private) and ever compute nodes using private IP. > Lustre: 1.6b5. > Here''s log from frontend (MGS+MDT) and the failed client node > > Failed client node: > > Lustre: mount data: > Lustre: profile: lustre-client > Lustre: device: 10.1.1.1@tcp:/lustre > Lustre: flags: 2 > LustreError: 22040:0:(client.c:579:ptlrpc_check_status()) @@@ type == > PTL_RPC_MSG_ERR, err == -107 > LustreError: 22040:0:(client.c:579:ptlrpc_check_status()) Skipped 3 > previous similar messages > LustreError: 22040:0:(mgc_request.c:964:mgc_process_log()) Can''t get > cfg lock: -107 > LustreError: 3099:0:(mgc_request.c:493:mgc_blocking_ast()) original > grant failed, won''t requeue > LustreError: 22040:0:(mgc_request.c:1014:mgc_process_log()) > MGC10.1.1.1@tcp: the configuration ''lustre-client'' could not be read > (-107) from the MGS. > LustreError: MGC10.1.1.1@tcp: The configuration ''lustre-client'' could > not be read from the MGS (-107). This may be the result of > communication errors between this node and the MGS, or the MGS may not > be running. > Lustre: 0 UP mgc MGC10.1.1.1@tcp f19e61f7-623f-55a2-6332-ea987600d10d 5 > Lustre: 1 UP ost OSS OSS_uuid 3 > Lustre: 2 UP obdfilter lustre-OST0001 lustre-OST0001_UUID 9 > LustreError: 22040:0:(llite_lib.c:909:ll_fill_super()) Unable to > process log: -107 > Lustre: client 0000010118688000 umount complete > LustreError: 22040:0:(obd_mount.c:1857:lustre_fill_super()) Unable to > mount (-107) > > > Frontend: > LustreError: 10490:0:(mgs_handler.c:468:mgs_handle()) lustre_mgs: > operation 101 on unconnected MGS > LustreError: 10490:0:(mgs_handler.c:468:mgs_handle()) Skipped 1 > previous similar message > LustreError: 10490:0:(ldlm_lib.c:1317:target_send_reply_msg()) @@@ > processing error (-107) > LustreError: 10490:0:(ldlm_lib.c:1317:target_send_reply_msg()) Skipped > 3 previous similar messages > > I think I strictly follow the guide at > https://mail.clusterfs.com/wikis/lustre/MountConf. I suppose that the > problem occurred because IP confusion on Frontend. But some compute > nodes successfully mount lustre file system frontend. > > Regards, >
Somsak Sriprayoonsakul
2006-Nov-09 01:49 UTC
[Lustre-discuss] Can''t mount lustre on some nodes
The problem is solved in a very weird way. I found that when I umount ost temporary and remount it again. The client mount just come back to work again. Now every nodes can see the file system without problem. lctl ping seems to work ok on every nodes (I didn''t test every possibility. But all few tests are success). Nathaniel Rutman wrote:> Use "lctl list_nids" and "lctl ping <remote_nid>" on the clients and > servers to help see where the problem is. > Somsak Sriprayoonsakul wrote: >> Dear List, >> >> I''m trying to set up a Lustre 1.6b5 cluster where every nodes >> except frontend serve OST, frontend serve MGS+MDT, and every nodes >> (including frontend) mount and use Lustre. Somehow there''s a weird >> problem where some nodes can''t mount lustre but some nodes can. >> >> My configuration: >> >> OS: Rocks 4.2.1 Cluster (CentOS 4.4) using stock lustre >> 2.6.9-42.EL_lustre.1.5.95smp kernel. Frontend has 2 IP (real + >> private) and ever compute nodes using private IP. >> Lustre: 1.6b5. >> Here''s log from frontend (MGS+MDT) and the failed client node >> >> Failed client node: >> >> Lustre: mount data: >> Lustre: profile: lustre-client >> Lustre: device: 10.1.1.1@tcp:/lustre >> Lustre: flags: 2 >> LustreError: 22040:0:(client.c:579:ptlrpc_check_status()) @@@ type == >> PTL_RPC_MSG_ERR, err == -107 >> LustreError: 22040:0:(client.c:579:ptlrpc_check_status()) Skipped 3 >> previous similar messages >> LustreError: 22040:0:(mgc_request.c:964:mgc_process_log()) Can''t get >> cfg lock: -107 >> LustreError: 3099:0:(mgc_request.c:493:mgc_blocking_ast()) original >> grant failed, won''t requeue >> LustreError: 22040:0:(mgc_request.c:1014:mgc_process_log()) >> MGC10.1.1.1@tcp: the configuration ''lustre-client'' could not be read >> (-107) from the MGS. >> LustreError: MGC10.1.1.1@tcp: The configuration ''lustre-client'' could >> not be read from the MGS (-107). This may be the result of >> communication errors between this node and the MGS, or the MGS may >> not be running. >> Lustre: 0 UP mgc MGC10.1.1.1@tcp >> f19e61f7-623f-55a2-6332-ea987600d10d 5 >> Lustre: 1 UP ost OSS OSS_uuid 3 >> Lustre: 2 UP obdfilter lustre-OST0001 lustre-OST0001_UUID 9 >> LustreError: 22040:0:(llite_lib.c:909:ll_fill_super()) Unable to >> process log: -107 >> Lustre: client 0000010118688000 umount complete >> LustreError: 22040:0:(obd_mount.c:1857:lustre_fill_super()) Unable to >> mount (-107) >> >> >> Frontend: >> LustreError: 10490:0:(mgs_handler.c:468:mgs_handle()) lustre_mgs: >> operation 101 on unconnected MGS >> LustreError: 10490:0:(mgs_handler.c:468:mgs_handle()) Skipped 1 >> previous similar message >> LustreError: 10490:0:(ldlm_lib.c:1317:target_send_reply_msg()) @@@ >> processing error (-107) >> LustreError: 10490:0:(ldlm_lib.c:1317:target_send_reply_msg()) >> Skipped 3 previous similar messages >> >> I think I strictly follow the guide at >> https://mail.clusterfs.com/wikis/lustre/MountConf. I suppose that the >> problem occurred because IP confusion on Frontend. But some compute >> nodes successfully mount lustre file system frontend. >> >> Regards, >> >