Andreas Dilger
2006-May-19 07:36 UTC
[Lustre-discuss] LustreError: ptlrpc_expire_one_request
On Oct 14, 2004 17:10 +0200, Martin Vogt wrote:> I can reproduce an error when I run iozone in an endless loop.Could you refresh my memory about what Lustre+kernel version you are running.> Here is the error: > > On the OST 2 I get under network load: > > LustreError: 7785:0:(ldlm_lockd.c:454:ldlm_server_completion_ast()) ### > enqueue wait took 141480917us ns: filter-media-ost2_UUID lock: > f3c578e0/0x8fb46cbaf9b8835f lrc: 3/0,0 mode: PW/PW res: 588/0 rrc: 1 > type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: > 0 remote: 0x8190143b3c07432a expref: 50 > > [...more of the same error ..] > > LustreError: 0:0:(ldlm_lockd.c:192:waiting_locks_callback()) ### lock > callback timer expired: evicting client > 71c5f_media-lov_7d75d07474@NET_0xc0a80901_UUID nid 0:192.168.9.1 ns: > filter-media-ost2_UUID lock: f291cd60/0x8fb46cbaf9b88581 lrc: 1/0,0 > mode: PW/PW res: 924/0 rrc: 5 type: EXT [4295000064->6497665023] (req > 4295000064->4295032831) flags: 10020 remote: 0x423206365c0bab7a expref: 3 > > [...more of the same error ..] > > LustreError: 7651:0:(ldlm_lockd.c:808:ldlm_handle_cancel()) received > cancel for unknown lock cookie 0x8fb46cbaf9b8855e from client > df172_media-lov_10d13a06b4 nid 0:192.168.9.2What are the relative timestamps of these messages? Do locks that are granted in a delayed manner (first messages, use "lock:" value to identify) also appear in the "lock callback timer expired" messages and later the "cancel for unknown lock" messages? It appears that the OST isn''t responding to lock requests in a timely manner, and then when the lock is finally granted the server isn''t giving the client very much time to use the lock. The other issue is that the timeout needs to be set so that clients can reasonably flush their cache to the OSTs within the timeout, so: client cache * num_clients / min(client bw, ost disk/net bw) < obd_timeout / 3 How many clients are involved, how many OSTs, what is the disk/net bandwidth?> Then after some time I get on the client2: > > LustreError: 14375:0:(client.c:817:ptlrpc_expire_one_request()) @@@ > timeout (sent at 1097754922, 100s ago) req@c98a4800 x33746/t803827 > o4->media-ost2_UUID@NID_192.168.9.12_UUID:6 lens 304/248 ref 3 fl > ?phase?:R/4/0 rc 0/0This means that the client writes are not completing to the OST in a timely manner. Cheers, Andreas -- Andreas Dilger
Hello clusterfs, I can reproduce an error when I run iozone in an endless loop. Here is the error: On the OST 2 I get under network load: LustreError: 7785:0:(ldlm_lockd.c:454:ldlm_server_completion_ast()) ### enqueue wait took 141480917us ns: filter-media-ost2_UUID lock: f3c578e0/0x8fb46cbaf9b8835f lrc: 3/0,0 mode: PW/PW res: 588/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0 remote: 0x8190143b3c07432a expref: 50 [...more of the same error ..] LustreError: 0:0:(ldlm_lockd.c:192:waiting_locks_callback()) ### lock callback timer expired: evicting client 71c5f_media-lov_7d75d07474@NET_0xc0a80901_UUID nid 0:192.168.9.1 ns: filter-media-ost2_UUID lock: f291cd60/0x8fb46cbaf9b88581 lrc: 1/0,0 mode: PW/PW res: 924/0 rrc: 5 type: EXT [4295000064->6497665023] (req 4295000064->4295032831) flags: 10020 remote: 0x423206365c0bab7a expref: 3 [...more of the same error ..] LustreError: 7651:0:(ldlm_lockd.c:808:ldlm_handle_cancel()) received cancel for unknown lock cookie 0x8fb46cbaf9b8855e from client df172_media-lov_10d13a06b4 nid 0:192.168.9.2 [...more of the same error ..] Then after some time I get on the client2: LustreError: 14375:0:(client.c:817:ptlrpc_expire_one_request()) @@@ timeout (sent at 1097754922, 100s ago) req@c98a4800 x33746/t803827 o4->media-ost2_UUID@NID_192.168.9.12_UUID:6 lens 304/248 ref 3 fl ?phase?:R/4/0 rc 0/0 Then lustre needs some time, and when it calls the upcall script the OST2 is reintegrated and the application continues. Before that there are no error messages. Can anyone help me with this error message? regards, Martin