Christopher J. Walker
2010-Jul-08 17:32 UTC
[Lustre-discuss] Clients losing connection to an OSS.
With 1.8.3 clients and 1.8.3 OSSs, a couple of my nodes seem to have lost connection to an OSS. If I do lfs df, I get the following: lustre_0-OST0028_UUID: Resource temporarily unavailable lustre_0-OST0029_UUID: Resource temporarily unavailable lustre_0-OST002a_UUID: Resource temporarily unavailable lustre_0-OST002b_UUID: Resource temporarily unavailable lustre_0-OST002c_UUID 6486115712 3882764932 2603348732 59% /mnt/lustre_0[OST:44] lustre_0-OST002d_UUID 6486115712 3797895540 2688209196 58% /mnt/lustre_0[OST:45] lustre_0-OST002e_UUID 6486115712 3717364684 2768740788 57% /mnt/lustre_0[OST:46] lustre_0-OST002f_UUID 6486115712 3535928996 2950180572 54% /mnt/lustre_0[OST:47] This has happened on several machines. Rebooting them seems to cure it. There are a large number of error messages in the logs - eg: Jul 7 18:22:14 cn458 kernel: Lustre: 3815:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1340150774596107 sent from lustre_0-OST0028-osc-ffff81021f55a400 to NID 10.1.4.121 at tcp 21s ago has timed out (21s prior to deadline). Jul 7 18:22:14 cn458 kernel: req at ffff8100841ed000 x1340150774596107/t0 o8->lustre_0-OST0028_UUID at 10.1.4.121@tcp:28/4 lens 368/584 e 0 to 1 dl 1278523334 ref 2 fl Rpc:N/0/0 rc 0/0 Jul 7 18:22:14 cn458 kernel: Lustre: 3815:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 52 previous similar messages Jul 7 18:23:06 cn458 kernel: Lustre: 3816:0:(import.c:517:import_select_connection()) lustre_0-OST0004-osc-ffff81021f55a400: tried all connections, increasing latency to 19s Jul 7 18:23:06 cn458 kernel: Lustre: 3816:0:(import.c:517:import_select_connection()) Skipped 58 previous similar messages Jul 7 18:26:48 cn458 kernel: Lustre: 3815:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1340150774596722 sent from lustre_0-OST0028-osc-ffff81021f55a400 to NID 10.1.4.121 at tcp 30s ago has timed out (30s prior to deadline). Jul 7 18:26:48 cn458 kernel: req at ffff8101e00d1800 x1340150774596722/t0 o8->lustre_0-OST0028_UUID at 10.1.4.121@tcp:28/4 lens 368/584 e 0 to 1 dl 1278523608 ref 2 fl Rpc:N/0/0 rc 0/0 Jul 7 18:26:48 cn458 kernel: Lustre: 3815:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 95 previous similar messages Jul 7 18:28:22 cn458 kernel: Lustre: 3816:0:(import.c:517:import_select_connection()) lustre_0-OST0028-osc-ffff81021f55a400: tried all connections, increasing latency to 25s Jul 7 18:28:22 cn458 kernel: Lustre: 3816:0:(import.c:517:import_select_connection()) Skipped 84 previous similar messages Jul 7 18:35:35 cn458 kernel: Lustre: 3815:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1340150774597865 sent from lustre_0-OST0028-osc-ffff81021f55a400 to NID 10.1.4.121 at tcp 30s ago has timed out (30s prior to deadline). Jul 7 18:35:35 cn458 kernel: req at ffff8101d66d6800 x1340150774597865/t0 o8->lustre_0-OST0028_UUID at 10.1.4.121@tcp:28/4 lens 368/584 e 0 to 1 dl 1278524135 ref 2 fl Rpc:N/0/0 rc 0/0 Is there a known problem? What information would help debug this? Chris PS clients are on bonded 1GigE, servers 10GigE (if that makes a difference).