Hi, I have made some tests with Lustre 1.6.3 (Kernel 2.6.18-8.1.14.el5_lustre.1.6.3smp) and came across the following problem: an unzip of a large zip archive on a lustre filessystem hangs (virtually forever) after about 30000 files have been extracted. strace shows that the chmod call on the client does not return. The problem is reproducible. The messages file on the client says (several times): Nov 14 16:54:19 linuxwcc07 kernel: LustreError: 11872:0:(client.c:969:ptlrpc_expire_one_request()) @@@ timeout (sent at 1195055558, 100s ago) req at ffff810201c61a00 x491921/t0 o36->lustre-MDT0000_UUID at 137.226.71.155@tcp:12 lens 5864/296 ref 1 fl Rpc:/0/0 rc 0/-22 Nov 14 16:54:19 linuxwcc07 kernel: LustreError: 11872:0:(client.c:969:ptlrpc_expire_one_request()) @@@ timeout (sent at 1195055558, 100s ago) req at ffff810201c61a00 x491921/t0 o36->lustre-MDT0000_UUID at 137.226.71.155@tcp:12 lens 5864/296 ref 1 fl Rpc:/0/0 rc 0/-22 Nov 14 16:54:19 linuxwcc07 kernel: Lustre: lustre-MDT0000-mdc-ffff81021adedc00: Connection to service lustre-MDT0000 via nid 137.226.71.155 at tcp was lost; in progress operations using this service will wait for recovery to complete. Nov 14 16:54:19 linuxwcc07 kernel: Lustre: lustre-MDT0000-mdc-ffff81021adedc00: Connection to service lustre-MDT0000 via nid 137.226.71.155 at tcp was lost; in progress operations using this service will wait for recovery to complete. Nov 14 16:54:19 linuxwcc07 kernel: Lustre: lustre-MDT0000-mdc-ffff81021adedc00: Connection restored to service lustre-MDT0000 using nid 137.226.71.155 at tcp. The corresponding messages on the MDS: Nov 14 16:52:38 linuxwcc05 kernel: LustreError: 7483:0:(lib-move.c:95:lnet_try_match_md()) Matching packet from 12345-137.226.71.157 at tcp, match 491921 length 5864 too big: 7416 left, 5120 allowed Nov 14 16:52:38 linuxwcc05 kernel: LustreError: 7483:0:(lib-move.c:95:lnet_try_match_md()) Matching packet from 12345-137.226.71.157 at tcp, match 491921 length 5864 too big: 7416 left, 5120 allowed Nov 14 16:54:19 linuxwcc05 kernel: Lustre: 7606:0:(ldlm_lib.c:514:target_handle_reconnect()) lustre-MDT0000: ec82c01d-f203-81b7-ed36-e0f0cf3b3f32 reconnecting Nov 14 16:54:19 linuxwcc05 kernel: Lustre: 7606:0:(ldlm_lib.c:514:target_handle_reconnect()) lustre-MDT0000: ec82c01d-f203-81b7-ed36-e0f0cf3b3f32 reconnecting Is this a known issue? Regards, Hans Schnitzer -- Hans-Juergen Schnitzer RWTH Aachen University, Center for Computing and Communication Rechen- und Kommunikationszentrum Seffenter Weg 23, 52074 Aachen (Germany) Tel.: + 49(0)241/80-28719 - Fax: + 49(0)241/80-628719 schnitzer at rz.rwth-aachen.de http://www.rz.rwth-aachen.de -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 5737 bytes Desc: S/MIME Cryptographic Signature Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20071114/260cda1e/attachment-0002.bin
Hans, on the surface this sounds a lot like the following bug we have SUN looking in too. If you have a good 1.6.3 reproducer could you please attach it to the bug. We''ve been chasing something like this for a while and it has been tricky to reproduce. I''ll certainly give your test case a spin and look in to this. https://bugzilla.lustre.org/show_bug.cgi?id=11332 Thanks, Brian> Hi, > > I have made some tests with Lustre 1.6.3 (Kernel > 2.6.18-8.1.14.el5_lustre.1.6.3smp) and came across the > following problem: an unzip of a large zip archive on a > lustre filessystem hangs (virtually forever) after about 30000 files > have been extracted. > strace shows that the chmod call on the client does not return. > The problem is reproducible. > > The messages file on the client says (several times): > Nov 14 16:54:19 linuxwcc07 kernel: LustreError: > 11872:0:(client.c:969:ptlrpc_expire_one_request()) @@@ timeout (sent at > 1195055558, 100s ago) req at ffff810201c61a00 x491921/t0 > o36->lustre-MDT0000_UUID at 137.226.71.155@tcp:12 lens 5864/296 ref 1 fl > Rpc:/0/0 rc 0/-22 > Nov 14 16:54:19 linuxwcc07 kernel: LustreError: > 11872:0:(client.c:969:ptlrpc_expire_one_request()) @@@ timeout (sent at > 1195055558, 100s ago) req at ffff810201c61a00 x491921/t0 > o36->lustre-MDT0000_UUID at 137.226.71.155@tcp:12 lens 5864/296 ref 1 fl > Rpc:/0/0 rc 0/-22 > Nov 14 16:54:19 linuxwcc07 kernel: Lustre: > lustre-MDT0000-mdc-ffff81021adedc00: Connection to service > lustre-MDT0000 via nid 137.226.71.155 at tcp was lost; in progress > operations using this service will wait for recovery to complete. > Nov 14 16:54:19 linuxwcc07 kernel: Lustre: > lustre-MDT0000-mdc-ffff81021adedc00: Connection to service > lustre-MDT0000 via nid 137.226.71.155 at tcp was lost; in progress > operations using this service will wait for recovery to complete. > Nov 14 16:54:19 linuxwcc07 kernel: Lustre: > lustre-MDT0000-mdc-ffff81021adedc00: Connection restored to service > lustre-MDT0000 using nid 137.226.71.155 at tcp. > > The corresponding messages on the MDS: > Nov 14 16:52:38 linuxwcc05 kernel: LustreError: > 7483:0:(lib-move.c:95:lnet_try_match_md()) Matching packet from > 12345-137.226.71.157 at tcp, match 491921 length 5864 too big: 7416 left, > 5120 allowed > Nov 14 16:52:38 linuxwcc05 kernel: LustreError: > 7483:0:(lib-move.c:95:lnet_try_match_md()) Matching packet from > 12345-137.226.71.157 at tcp, match 491921 length 5864 too big: 7416 left, > 5120 allowed > Nov 14 16:54:19 linuxwcc05 kernel: Lustre: > 7606:0:(ldlm_lib.c:514:target_handle_reconnect()) lustre-MDT0000: > ec82c01d-f203-81b7-ed36-e0f0cf3b3f32 reconnecting > Nov 14 16:54:19 linuxwcc05 kernel: Lustre: > 7606:0:(ldlm_lib.c:514:target_handle_reconnect()) lustre-MDT0000: > ec82c01d-f203-81b7-ed36-e0f0cf3b3f32 reconnecting > > Is this a known issue? > > Regards, > Hans Schnitzer-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20071114/4cafd456/attachment-0002.bin