Anatoly Oreshkin
2007-Dec-12 15:52 UTC
[Lustre-discuss] reading file hangs on Lustre 1.6.4 node
Hello, I am a novice to Lustre. I''ve installed Lustre 1.6.4 on Scientific Linux 4.4 with kernel 2.6.9-55.0.9.EL_lustre.1.6.4smp MGS server, MDS server and OST server all are installed on head node. MGS and MDS servers have their storage on different disks. MGS server on /dev/sdb1 disk /usr/sbin/mkfs.lustre --fsname=vtrak1fs --mgs /dev/sdb1 MDS server on /dev/sdc1 /usr/sbin/mkfs.lustre --fsname=vtrak1fs --mdt --mgsnode=head_node at tcp0 /dev/sdc1 OST storage is based on RAID5 and connected via SCSI directly to head node. OST1 server on /dev/sdg1 /usr/sbin/mkfs.lustre --fsname=vtrak1fs --ost --mgsnode=head_node at tcp0 /dev/sdg1 On client node Lustre is started by mount mount -t lustre head_node at tcp0:/vtrak1fs /vtrak1 TCP networking is used for communication with nodes. The file /etc/modprobe.conf contains the line: options lnet networks=tcp Command /usr/sbin/lctl list_nids issued on head node gives 85.142.10.197 at tcp For testing purpose I was reading all files on head node from OST1. All files were read successfuly. Then I started the same read test of all files from OST1 on client node with address 192.168.1.2 Command /usr/sbin/lctl list_nids issued on client node gives: 192.168.1.2 at tcp In this case read test reads a number of files and then hangs on some file. The command dmesg issued on client node gives such error messages: LustreError: 5017:0:(socklnd.c:1599:ksocknal_destroy_conn()) Completing partial receive from 12345-85.142.10.197 at tcp, ip 85.142.10.197:988, with error LustreError: 5017:0:(events.c:134:client_bulk_callback()) event type 1, status -5, desc ca9d3c00 LustreError: 5019:0:(client.c:975:ptlrpc_expire_one_request()) @@@ timeout (sent at 1197447164, 150s ago) req at ca9d3200 x4566962/t0 o3->vtrak1fs-OST0000_UUID at 85.142.10.197@tcp:28 lens 384/336 ref 2 fl Rpc:/0/0 rc 0/-22 LustreError: 5019:0:(client.c:975:ptlrpc_expire_one_request()) Skipped 8 previous similar messages Lustre: vtrak1fs-OST0000-osc-f7f53200: Connection to service vtrak1fs-OST0000 via nid 85.142.10.197 at tcp was lost; in progress operations using this service will wait for recovery to complete. Lustre: vtrak1fs-OST0000-osc-f7f53200: Connection restored to service vtrak1fs-OST0000 using nid 85.142.10.197 at tcp. Lustre: Skipped 1 previous similar message hw tcp v4 csum failed hw tcp v4 csum failed ... Dmesg issued on head node gives errors: LustreError: 15048:0:(ost_handler.c:821:ost_brw_read()) @@@ timeout on bulk PUT req at cc6af600 x4566962/t0 o3->629198c9-085d-f95a-462f-b5e535904a3d at NET_0x20000c0a80102_UUID:-1 lens 384/336 ref 0 fl Interpret:/0/0 rc 0/0 Lustre: 15048:0:(ost_handler.c:881:ost_brw_read()) vtrak1fs-OST0000: ignoring bulk IO comm error with 629198c9-085d-f95a-462f-b5e535904a3d at NET_0x20000c0a80102_UUID id 12345-192.168.1.2 at tcp - client will retry Lustre: 14987:0:(ldlm_lib.c:519:target_handle_reconnect()) vtrak1fs-OST0000: 629198c9-085d-f95a-462f-b5e535904a3d reconnecting On Lustre client data checksums are disabled by default. cat /proc/fs/lustre/llite/vtrak1fs-f7f53200/checksum_pages -> 0 What might be the reason(s) ? Any hints ? How to trace the problem ? Thank you.
Brian J. Murrell
2007-Dec-12 19:37 UTC
[Lustre-discuss] reading file hangs on Lustre 1.6.4 node
On Wed, 2007-12-12 at 18:52 +0300, Anatoly Oreshkin wrote:> > In this case read test reads a number of files and then hangs on some file. > The command dmesg issued on client node gives such error messages: > > LustreError: 5017:0:(socklnd.c:1599:ksocknal_destroy_conn()) Completing partial > receive from 12345-85.142.10.197 at tcp, ip 85.142.10.197:988, with errorI''m not really sure the origin or meaning of this message but it seems pretty clear. This looks like a networking issue. ...> hw tcp v4 csum failed > hw tcp v4 csum failedAnd this makes it look even more like a networking issue.> Dmesg issued on head node gives errors: > > LustreError: 15048:0:(ost_handler.c:821:ost_brw_read()) @@@ timeout on bulk PUT > req at cc6af600 x4566962/t0 > o3->629198c9-085d-f95a-462f-b5e535904a3d at NET_0x20000c0a80102_UUID:-1 lens > 384/336 ref 0 fl Interpret:/0/0 rc 0/0 > Lustre: 15048:0:(ost_handler.c:881:ost_brw_read()) vtrak1fs-OST0000: ignoring > bulk IO comm error with > 629198c9-085d-f95a-462f-b5e535904a3d at NET_0x20000c0a80102_UUID id > 12345-192.168.1.2 at tcp - client will retry > Lustre: 14987:0:(ldlm_lib.c:519:target_handle_reconnect()) vtrak1fs-OST0000: > 629198c9-085d-f95a-462f-b5e535904a3d reconnectingAll more indications of networking issues. I think you need to start debugging your network. b.