Dear all, I have an odd problem today in my lustre 1.8.0. All of the OSSes and MDS appear well. But one of client has problem. When creating a file in OST5(one of my osts), and dd or echo something to this file, then the process hangs, and never succeeds. for example, client1:/home # lfs setstripe -o 5 test.txt client1:/home # lfs getstripe test.txt OBDS: 0: lustre-OST0000_UUID ACTIVE 1: lustre-OST0001_UUID ACTIVE 2: lustre-OST0002_UUID ACTIVE 3: lustre-OST0003_UUID ACTIVE 4: lustre-OST0004_UUID ACTIVE 5: lustre-OST0005_UUID ACTIVE 6: lustre-OST0006_UUID ACTIVE 7: lustre-OST0007_UUID ACTIVE 8: lustre-OST0008_UUID ACTIVE 9: lustre-OST0009_UUID ACTIVE 10: lustre-OST000a_UUID ACTIVE 11: lustre-OST000b_UUID ACTIVE 12: lustre-OST000c_UUID ACTIVE 13: lustre-OST000d_UUID ACTIVE 14: lustre-OST000e_UUID ACTIVE 15: lustre-OST000f_UUID ACTIVE 16: lustre-OST0010_UUID ACTIVE test.txt obdidx objid objid group 5 158029029 0x96b54e5 0 client1:/home # dd if=/dev/zero of=test.txt bs=1M count=100 then the dd process hangs and never return. If I edit and save it, then it''s location changes to another OST, not OST5. for example client1:/home # dd if=/dev/zero of=test.txt bs=1M count=100 #(ctrl-C) 1+0 records in 0+0 records out 0 bytes (0 B) copied, 173.488 seconds, 0.0 kB/s client1:/home # vi test.txt #add something client1:/home # lfs getstripe test.txt OBDS: 0: lustre-OST0000_UUID ACTIVE 1: lustre-OST0001_UUID ACTIVE 2: lustre-OST0002_UUID ACTIVE 3: lustre-OST0003_UUID ACTIVE 4: lustre-OST0004_UUID ACTIVE 5: lustre-OST0005_UUID ACTIVE 6: lustre-OST0006_UUID ACTIVE 7: lustre-OST0007_UUID ACTIVE 8: lustre-OST0008_UUID ACTIVE 9: lustre-OST0009_UUID ACTIVE 10: lustre-OST000a_UUID ACTIVE 11: lustre-OST000b_UUID ACTIVE 12: lustre-OST000c_UUID ACTIVE 13: lustre-OST000d_UUID ACTIVE 14: lustre-OST000e_UUID ACTIVE 15: lustre-OST000f_UUID ACTIVE 16: lustre-OST0010_UUID ACTIVE test.txt obdidx objid objid group 6 159122026 0x97c026a 0 But both the client and the OSS seems good. By the way, other clients and OSS have not this problem. client1:/home # lfs check servers lustre-MDT0000-mdc-ffff810438d12c00 active. lustre-OST000a-osc-ffff810438d12c00 active. lustre-OST000f-osc-ffff810438d12c00 active. lustre-OST000c-osc-ffff810438d12c00 active. lustre-OST0006-osc-ffff810438d12c00 active. lustre-OST000e-osc-ffff810438d12c00 active. lustre-OST0009-osc-ffff810438d12c00 active. lustre-OST0000-osc-ffff810438d12c00 active. lustre-OST000d-osc-ffff810438d12c00 active. lustre-OST0003-osc-ffff810438d12c00 active. lustre-OST0002-osc-ffff810438d12c00 active. lustre-OST0008-osc-ffff810438d12c00 active. lustre-OST000b-osc-ffff810438d12c00 active. lustre-OST0004-osc-ffff810438d12c00 active. lustre-OST0007-osc-ffff810438d12c00 active. lustre-OST0005-osc-ffff810438d12c00 active. lustre-OST0010-osc-ffff810438d12c00 active. lustre-OST0001-osc-ffff810438d12c00 active. I try it many times. The log report some error messages only once. On client: Dec 19 18:28:57 client1 kernel: LustreError: 11-0: an error occurred while communicating with 12.12.71.106 at o2ib. The ost_punch operation failed with -107 Dec 19 18:28:57 client1 kernel: LustreError: Skipped 1 previous similar message Dec 19 18:28:57 client1 kernel: Lustre: lustre-OST0005-osc-ffff810438d12c00: Connection to service lustre-OST0005 via nid 12.12.71.106 at o2ib was lost; in prog ress operations using this service will wait for recovery to complete. Dec 19 18:28:57 client1 kernel: LustreError: 4570:0:(import.c:909:ptlrpc_connect_interpret()) lustre-OST0005_UUID went back in time (transno 189979771521 was previously committed, server now claims 0)! See https://bugzilla.lustre.org/show_bug.cgi?id=9646 Dec 19 18:28:57 client1 kernel: LustreError: 167-0: This client was evicted by lustre-OST0005; in progress operations using this service will fail. Dec 19 18:28:57 client1 kernel: LustreError: 7128:0:(rw.c:192:ll_file_punch()) obd_truncate fails (-5) ino 41729130 Dec 19 18:28:57 client1 kernel: Lustre: lustre-OST0005-osc-ffff810438d12c00: Connection restored to service lustre-OST0005 using nid 12.12.71.106 at o2ib. On OSS: Dec 19 18:27:52 os6 kernel: LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback timer expired after 101s: evicting client at 12 .12.12.32 at o2ib ns: filter-lustre-OST0005_UUID lock: ffff810087d66200/0xae56b014db6d6d0a lrc: 3/0,0 mode: PR/PR res: 158015656/0 rrc: 2 type: EXT [0->1844 6744073709551615] (req 0->18446744073709551615) flags: 0x10020 remote: 0xe02336632642c5fc expref: 27 pid: 5333 timeout 7284896273 Dec 19 18:28:57 os6 kernel: LustreError: 5407:0:(ldlm_lib.c:1826:target_send_reply_msg()) @@@ processing error (-107) req at ffff8103dd91b400 x1343016412725286/t0 o10-><?>@<?>:0/0 lens 400/0 e 0 to 0 dl 1292754580 ref 1 fl Interpret:/0/0 rc -107/0 The MDS has no messages related with this. I don''t know whether these messages have some relationship with the problem or not. My lustre version is 1.8.0 with SLES 10 sp2, by the way, the MDS crashed yesterday with bug #19528, so now the lustre on MDS has patched with attachment 23574, 23648 and 23751 in bz #19528, other node have no patches. Thanks all.