Hello all,
We have a serious problem with lustre. Since a few days we have
lockups on the client side. Not all clients are having this
problem.
We are running this kernel 2.6.16-54-0.2.5_lustre.1.6.4.3smp.
The statahead disable is done on the systems.
Some more information about the environment:
- Lustre clients are all vmware virtual systems
- Lustre Farm are all vmware virtual systems
the errors I see are the following:
LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
0, status -5, desc ffff8100e5dca000
LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
0, status -5, desc ffff8100e519e000
LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
0, status -5, desc ffff8100e4e0a000
LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
0, status -5, desc ffff8100e86b1bc0
LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
0, status -5, desc ffff8100e79fe5c0
LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
0, status -5, desc ffff8100e70a88c0
LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
0, status -5, desc ffff8100e7081280
LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
0, status -5, desc ffff8100e6d6d5c0
LustreError: 3428:0:(client.c:975:ptlrpc_expire_one_request()) @@@
timeout (sent at 1225816920, 100s ago) req at ffff8100e7e2ba00 x17940/t0
o4->lustre-OST0005_UUID at 172.16.0.29@tcp:28 lens 384/352 ref 2 fl Rpc:/
0/0 rc 0/-22
Lustre: lustre-OST0005-osc-ffff8100e8551800: Connection to service
lustre-OST0005 via nid 172.16.0.29 at tcp was lost; in progress
operations using this service will wait for recovery to complete.
Lustre: lustre-OST0005-osc-ffff8100e8551800: Connection restored to
service lustre-OST0005 using nid 172.16.0.29 at tcp.
LustreError: 3602:0:(client.c:975:ptlrpc_expire_one_request()) @@@
timeout (sent at 1225816924, 100s ago) req at ffff8100e64b3a00 x19702/t0
o36->lustre-MDT0000_UUID at 172.16.0.22@tcp:12 lens 1544/296 ref 1 fl
Rpc:/0/0 rc 0/-22
LustreError: 3602:0:(client.c:975:ptlrpc_expire_one_request()) Skipped
2 previous similar messages
Lustre: lustre-MDT0000-mdc-ffff8100e8551800: Connection to service
lustre-MDT0000 via nid 172.16.0.22 at tcp was lost; in progress
operations using this service will wait for recovery to complete.
Lustre: lustre-MDT0000-mdc-ffff8100e8551800: Connection restored to
service lustre-MDT0000 using nid 172.16.0.22 at tcp.
LustreError: 3428:0:(client.c:975:ptlrpc_expire_one_request()) @@@
timeout (sent at 1225816953, 100s ago) req at ffff8100e7e2d800 x20560/t0
o4->lustre-OST0006_UUID at 172.16.0.30@tcp:28 lens 384/352 ref 2 fl Rpc:/
0/0 rc 0/-22
Lustre: lustre-OST0006-osc-ffff8100e8551800: Connection to service
lustre-OST0006 via nid 172.16.0.30 at tcp was lost; in progress
operations using this service will wait for recovery to complete.
Lustre: lustre-OST0006-osc-ffff8100e8551800: Connection restored to
service lustre-OST0006 using nid 172.16.0.30 at tcp.
LustreError: 3602:0:(client.c:975:ptlrpc_expire_one_request()) @@@
timeout (sent at 1225817024, 100s ago) req at ffff8100e64b3a00 x19702/t0
o36->lustre-MDT0000_UUID at 172.16.0.22@tcp:12 lens 1544/296 ref 1 fl
Rpc:/2/0 rc -11/-22
Lustre: lustre-MDT0000-mdc-ffff8100e8551800: Connection to service
lustre-MDT0000 via nid 172.16.0.22 at tcp was lost; in progress
operations using this service will wait for recovery to complete.
Lustre: lustre-MDT0000-mdc-ffff8100e8551800: Connection restored to
service lustre-MDT0000 using nid 172.16.0.22 at tcp.
LustreError: 3428:0:(client.c:975:ptlrpc_expire_one_request()) @@@
timeout (sent at 1225817053, 100s ago) req at ffff8100e7e2d800 x20724/t0
o4->lustre-OST0006_UUID at 172.16.0.30@tcp:28 lens 384/352 ref 2 fl Rpc:/
2/0 rc -11/-22
Lustre: lustre-OST0006-osc-ffff8100e8551800: Connection to service
lustre-OST0006 via nid 172.16.0.30 at tcp was lost; in progress
operations using this service will wait for recovery to complete.
Lustre: lustre-OST0006-osc-ffff8100e8551800: Connection restored to
service lustre-OST0006 using nid 172.16.0.30 at tcp.
LustreError: 3602:0:(client.c:975:ptlrpc_expire_one_request()) @@@
timeout (sent at 1225817124, 100s ago) req at ffff8100e64b3a00 x19702/t0
o36->lustre-MDT0000_UUID at 172.16.0.22@tcp:12 lens 1544/296 ref 1 fl
Rpc:/2/0 rc -11/-22
Lustre: lustre-MDT0000-mdc-ffff8100e8551800: Connection to service
lustre-MDT0000 via nid 172.16.0.22 at tcp was lost; in progress
operations using this service will wait for recovery to complete.
Lustre: lustre-MDT0000-mdc-ffff8100e8551800: Connection restored to
service lustre-MDT0000 using nid 172.16.0.22 at tcp.
LustreError: 3428:0:(client.c:975:ptlrpc_expire_one_request()) @@@
timeout (sent at 1225817153, 100s ago) req at ffff8100e7e2d800 x20767/t0
o4->lustre-OST0006_UUID at 172.16.0.30@tcp:28 lens 384/352 ref 2 fl Rpc:/
2/0 rc -11/-22
Lustre: lustre-OST0006-osc-ffff8100e8551800: Connection to service
lustre-OST0006 via nid 172.16.0.30 at tcp was lost; in progress
operations using this service will wait for recovery to complete.
Lustre: lustre-OST0006-osc-ffff8100e8551800: Connection restored to
service lustre-OST0006 using nid 172.16.0.30 at tcp.
LustreError: 3602:0:(client.c:975:ptlrpc_expire_one_request()) @@@
timeout (sent at 1225817224, 100s ago) req at ffff8100e64b3a00 x19702/t0
o36->lustre-MDT0000_UUID at 172.16.0.22@tcp:12 lens 1544/296 ref 1 fl
Rpc:/2/0 rc -11/-22
Lustre: lustre-MDT0000-mdc-ffff8100e8551800: Connection to service
lustre-MDT0000 via nid 172.16.0.22 at tcp was lost; in progress
operations using this service will wait for recovery to complete.
Lustre: lustre-MDT0000-mdc-ffff8100e8551800: Connection restored to
service lustre-MDT0000 using nid 172.16.0.22 at tcp.
LustreError: 3602:0:(client.c:975:ptlrpc_expire_one_request()) @@@
timeout (sent at 1225817324, 100s ago) req at ffff8100e64b3a00 x19702/t0
o36->lustre-MDT0000_UUID at 172.16.0.22@tcp:12 lens 1544/296 ref 1 fl
Rpc:/2/0 rc -11/-22
LustreError: 3602:0:(client.c:975:ptlrpc_expire_one_request()) Skipped
1 previous similar message
Lustre: lustre-MDT0000-mdc-ffff8100e8551800: Connection to service
lustre-MDT0000 via nid 172.16.0.22 at tcp was lost; in progress
operations using this service will wait for recovery to complete.
Lustre: Skipped 1 previous similar message
Lustre: lustre-MDT0000-mdc-ffff8100e8551800: Connection restored to
service lustre-MDT0000 using nid 172.16.0.22 at tcp.
Lustre: Skipped 1 previous similar message
LustreError: 3602:0:(client.c:975:ptlrpc_expire_one_request()) @@@
timeout (sent at 1225817424, 100s ago) req at ffff8100e64b3a00 x19702/t0
o36->lustre-MDT0000_UUID at 172.16.0.22@tcp:12 lens 1544/296 ref 1 fl
Rpc:/2/0 rc -11/-22
LustreError: 3602:0:(client.c:975:ptlrpc_expire_one_request()) Skipped
1 previous similar message
Lustre: lustre-MDT0000-mdc-ffff8100e8551800: Connection to service
lustre-MDT0000 via nid 172.16.0.22 at tcp was lost; in progress
operations using this service will wait for recovery to complete.
Lustre: Skipped 1 previous similar message
Lustre: lustre-MDT0000-mdc-ffff8100e8551800: Connection restored to
service lustre-MDT0000 using nid 172.16.0.22 at tcp.
Lustre: Skipped 1 previous similar message
LustreError: 3428:0:(client.c:975:ptlrpc_expire_one_request()) @@@
timeout (sent at 1225817553, 100s ago) req at ffff8100e7e2d800 x20952/t0
o4->lustre-OST0006_UUID at 172.16.0.30@tcp:28 lens 384/352 ref 2 fl Rpc:/
2/0 rc -11/-22
LustreError: 3428:0:(client.c:975:ptlrpc_expire_one_request()) Skipped
2 previous similar messages
Lustre: lustre-OST0006-osc-ffff8100e8551800: Connection to service
lustre-OST0006 via nid 172.16.0.30 at tcp was lost; in progress
operations using this service will wait for recovery to complete.
Lustre: Skipped 2 previous similar messages
Lustre: lustre-OST0006-osc-ffff8100e8551800: Connection restored to
service lustre-OST0006 using nid 172.16.0.30 at tcp.
Lustre: Skipped 2 previous similar messages
LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
0, status -5, desc ffff8100efba6800
LustreError: 3602:0:(client.c:975:ptlrpc_expire_one_request()) @@@
timeout (sent at 1225817824, 99s ago) req at ffff8100e64b3a00 x19702/t0
o36->lustre-MDT0000_UUID at 172.16.0.22@tcp:12 lens 1544/296 ref 1 fl
Rpc:/2/0 rc -11/-22
LustreError: 3602:0:(client.c:975:ptlrpc_expire_one_request()) Skipped
4 previous similar messages
Lustre: lustre-MDT0000-mdc-ffff8100e8551800: Connection to service
lustre-MDT0000 via nid 172.16.0.22 at tcp was lost; in progress
operations using this service will wait for recovery to complete.
Lustre: Skipped 4 previous similar messages
Lustre: lustre-MDT0000-mdc-ffff8100e8551800: Connection restored to
service lustre-MDT0000 using nid 172.16.0.22 at tcp.
Lustre: Skipped 4 previous similar messages
Could somebody help me out ?
Thanks in advance.
Kurt