Hello all,
I have a problem with lustre 1.8.0 on Centos 5.0.
There are logs on client:
Oct 4 23:56:13 web5 kernel: Lustre: lustre-MDT0000-mdc-ffff81022e8e9400:
Connection to service lustre-MDT0000 via nid 192.168.1.78 at tcp was lost; in
progress
Oct 4 23:56:13 web5 kernel: LustreError:
3260:0:(import.c:909:ptlrpc_connect_interpret()) lustre-MDT0000_UUID went
back in time (transno 12892928971 was pre
Oct 4 23:56:13 web5 kernel: LustreError: 167-0: This client was evicted by
lustre-MDT0000; in progress operations using this service will fail.
Oct 4 23:56:13 web5 kernel: LustreError:
1204:0:(client.c:792:ptlrpc_import_delay_req()) @@@ IMP_INVALID
req at ffff81017b295c00 x1315300608347977/t0 o101->lu
Oct 4 23:56:13 web5 kernel: LustreError:
1204:0:(mdc_locks.c:601:mdc_enqueue()) ldlm_cli_enqueue: -108
Oct 4 23:56:13 web5 php-cgi: PHP Warning: session_start() [<a
href=''function.session-start''>function.session-start</a>]:
open(/mnt/lustre/megashare/session
Oct 4 23:56:13 web5 php-cgi: PHP Warning: session_start() [<a
href=''function.session-start''>function.session-start</a>]:
open(/mnt/lustre/megashare/session
Oct 4 23:56:13 web5 php-cgi: PHP Warning: session_start() [<a
href=''function.session-start''>function.session-start</a>]:
open(/mnt/lustre/megashare/session
Oct 4 23:56:13 web5 kernel: LustreError:
1193:0:(client.c:792:ptlrpc_import_delay_req()) @@@ IMP_INVALID
req at ffff81016f855000 x1315300608347982/t0 o101->lu
Oct 4 23:56:13 web5 kernel: LustreError:
1193:0:(client.c:792:ptlrpc_import_delay_req()) Skipped 4 previous similar
messages
Oct 4 23:56:13 web5 php-cgi: PHP Warning: Unknown:
open(/mnt/lustre/megashare/session/sess_7s287b1tjipfoh1ud1basgu773, O_RDWR)
failed: Cannot send after tr
Oct 4 23:56:13 web5 php-cgi: PHP Warning: Unknown: Failed to write session
data (files). Please verify that the current setting of session.save_path is
cor
Oct 4 23:56:13 web5 kernel: LustreError:
1193:0:(mdc_locks.c:601:mdc_enqueue()) ldlm_cli_enqueue: -108
Oct 4 23:56:13 web5 kernel: LustreError:
1193:0:(mdc_locks.c:601:mdc_enqueue()) Skipped 4 previous similar messages
Oct 4 23:56:13 web5 php-cgi: PHP Warning:
fopen(/mnt/lustre/megashare/tmp/query-qqjrdr3m16ni2s21kurbilbeu2) [<a
href=''function.fopen''>function.fopen</a>]:
Oct 4 23:56:13 web5 php-cgi: PHP Warning:
fopen(/mnt/lustre/megashare/tmp/query-qqjrdr3m16ni2s21kurbilbeu2) [<a
href=''function.fopen''>function.fopen</a>]:
Oct 4 23:56:13 web5 kernel: LustreError:
1605:0:(dir.c:239:ll_get_dir_page()) lock enqueue: rc: -108
Oct 4 23:56:13 web5 php-cgi: PHP Warning: fwrite(): supplied argument is
not a valid stream resource in /srv/www/lighttpd/search.php on line 66
Oct 4 23:56:13 web5 php-cgi: PHP Warning: fwrite(): supplied argument is
not a valid stream resource in /srv/www/lighttpd/search.php on line 66
Oct 4 23:56:13 web5 kernel: LustreError: 1605:0:(dir.c:384:ll_readdir_18())
error reading dir 28443012/2724301973 page 171: rc -108
Oct 4 23:56:13 web5 php-cgi: PHP Warning: fclose(): supplied argument is
not a valid stream resource in /srv/www/lighttpd/search.php on line 67
Oct 4 23:56:13 web5 kernel: LustreError:
1605:0:(dir.c:239:ll_get_dir_page()) lock enqueue: rc: -108
Oct 4 23:56:13 web5 kernel: LustreError:
1605:0:(dir.c:239:ll_get_dir_page()) Skipped 10 previous similar messages
Oct 4 23:56:13 web5 kernel: LustreError: 1605:0:(dir.c:384:ll_readdir_18())
error reading dir 28443012/2724301973 page 182: rc -108
Oct 4 23:56:13 web5 php-cgi: PHP Warning: fclose(): supplied argument is
not a valid stream resource in /srv/www/lighttpd/search.php on line 67
Oct 4 23:56:13 web5 kernel: LustreError: 1605:0:(dir.c:384:ll_readdir_18())
Skipped 10 previous similar messages
Oct 4 23:56:13 web5 kernel: LustreError:
1714:0:(file.c:114:ll_close_inode_openhandle()) inode 28511437 mdc close
failed: rc = -108
Oct 4 23:56:13 web5 kernel: Lustre: lustre-MDT0000-mdc-ffff81022e8e9400:
Connection restored to service lustre-MDT0000 using nid 192.168.1.78 at tcp.
Oct 4 23:56:14 web5 php-cgi: PHP Warning:
unlink(/mnt/lustre/megashare/tmp/query-qqjrdr3m16ni2s21kurbilbeu2) [<a
href=''function.unlink''>function.unlink</a>
Oct 4 23:56:14 web5 php-cgi: PHP Warning:
unlink(/mnt/lustre/megashare/tmp/query-qqjrdr3m16ni2s21kurbilbeu2) [<a
href=''function.unlink''>function.unlink</a>
Oct 4 23:56:14 web5 kernel: LustreError: 11-0: an error occurred while
communicating with 192.168.1.78 at tcp. The mds_close operation failed with
-116
Oct 4 23:56:14 web5 kernel: LustreError:
1605:0:(file.c:114:ll_close_inode_openhandle()) inode 28443012 mdc close
failed: rc = -116
Oct 4 23:56:14 web5 kernel: LustreError:
1605:0:(ldlm_lock.c:602:ldlm_lock_decref_internal_nolock())
ASSERTION(lock->l_writers > 0) failed
Oct 4 23:56:14 web5 kernel: LustreError:
1605:0:(ldlm_lock.c:602:ldlm_lock_decref_internal_nolock()) LBUG
Oct 4 23:56:14 web5 kernel: Lustre:
1605:0:(linux-debug.c:222:libcfs_debug_dumpstack()) showing stack for
process 1605
Oct 4 23:56:14 web5 kernel: php-cgi R running task 0 1605
10752 1704 1592 (NOTLB)
Oct 4 23:56:14 web5 kernel: ffff810169e3fba4 0000000188496a21
ffff810169e3fba0 ffff810176aaac80
Oct 4 23:56:14 web5 kernel: ffff81022e0f7e00 0000000000000000
0000002000000000 0000000100000001
Oct 4 23:56:14 web5 kernel: ffff8101677d4060 000000028848dd29
0000000000000000 ffff81012da80cc0
Oct 4 23:56:14 web5 kernel: Call Trace:
Oct 4 23:56:14 web5 kernel: [<ffffffff884b802f>]
:ptlrpc:ldlm_flock_completion_ast+0x83f/0x920
Oct 4 23:56:14 web5 kernel: [<ffffffff884913e9>]
:ptlrpc:ldlm_lock_enqueue+0x879/0x9c0
Oct 4 23:56:14 web5 kernel: [<ffffffff884a9960>]
:ptlrpc:ldlm_cli_enqueue_fini+0xa40/0xbd0
Oct 4 23:56:14 web5 kernel: [<ffffffff884a8c93>]
:ptlrpc:ldlm_prep_elc_req+0x2f3/0x420
Oct 4 23:56:14 web5 kernel: [<ffffffff80087d97>]
default_wake_function+0x0/0xe
Oct 4 23:56:14 web5 kernel: [<ffffffff884ab791>]
:ptlrpc:ldlm_cli_enqueue+0x681/0x740
Oct 4 23:56:14 web5 kernel: [<ffffffff886727cf>]
:lustre:ll_file_flock+0x57f/0x670
Oct 4 23:56:14 web5 kernel: [<ffffffff884b77f0>]
:ptlrpc:ldlm_flock_completion_ast+0x0/0x920
Oct 4 23:56:14 web5 kernel: [<ffffffff80030506>]
locks_remove_posix+0x84/0xa8
Oct 4 23:56:14 web5 kernel: [<ffffffff80062f4b>] thread_return+0x0/0xdf
Oct 4 23:56:14 web5 kernel: [<ffffffff80023a32>] filp_close+0x54/0x64
Oct 4 23:56:14 web5 kernel: [<ffffffff8001da4b>] sys_close+0x88/0xa2
Oct 4 23:56:14 web5 kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0
Oct 4 23:56:14 web5 kernel:
Oct 4 23:56:14 web5 kernel: LustreError: dumping log to
/tmp/lustre-log.1254675374.1605
Oct 5 00:02:35 web5 kernel: BUG: soft lockup - CPU#1 stuck for 10s!
[php-cgi:1782]
Oct 5 00:02:35 web5 kernel: CPU 1:
Oct 5 00:02:35 web5 kernel: Modules linked in: nfs(U) lockd(U) fscache(U)
nfs_acl(U) mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U)
ptlrpc(U) o
Oct 5 00:02:35 web5 kernel: Pid: 1782, comm: php-cgi Tainted: G
2.6.18-92.1.17.el5_lustre.1.8.0custom #2
....................
and, logs on MDS ( mgs,mdt on the same node).
Oct 4 23:55:29 MDS1 kernel: LustreError:
0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback timer
expired after 101s: evicting client at 192.168.1.75 at tcp ns:
mds-lustre-MDT0000_UUID lock: ffff81010de8dc00/0x2d5a67974b5b0f96 lrc: 3/0,0
mode: CR/CR res: 28444597/2724645784 bits 0x3 rrc: 2 type: IBT flags:
0x4000020 remote: 0x9b8763ff47462714 expref: 869 pid: 19255 timeout:
4921443228
Oct 4 23:55:29 MDS1 kernel: Lustre:
19259:0:(mds_reint.c:237:mds_finish_transno()) commit transaction for
disconnected client 039bd606-7bf9-7b26-2866-c58d394d5441: rc 0
Oct 4 23:55:29 MDS1 kernel: LustreError:
19304:0:(mds_open.c:1665:mds_close()) @@@ no handle for file close ino
28443012: cookie 0x2d5a67974b5b0b60 req at ffff81011c5c9c00x1315300608349333/t0
o35->039bd606-7bf9-7b26-2866-c58d394d5441 at NET_0x20000c0a8014b_UUID:0/0
lens
408/696 e 0 to 0 dl 1254675335 ref 1 fl Interpret:/0/0 rc 0/0
Oct 4 23:55:29 MDS1 kernel: LustreError:
19304:0:(ldlm_lib.c:1826:target_send_reply_msg()) @@@ processing error
(-116) req at ffff81011c5c9c00 x1315300608349333/t0
o35->039bd606-7bf9-7b26-2866-c58d394d5441 at NET_0x20000c0a8014b_UUID:0/0
lens
408/696 e 0 to 0 dl 1254675335 ref 1 fl Interpret:/0/0 rc -116/0
Oct 5 00:51:08 MDS1 kernel: LustreError:
21766:0:(handler.c:161:mds_sendpage()) @@@ bulk failed: timeout 0(4096),
evicting 039bd606-7bf9-7b26-2866-c58d394d5441 at NET_0x20000c0a8014b_UUID
Oct 5 00:51:08 MDS1 kernel: req at ffff81018360ac00 x1315300608351958/t0
o37->039bd606-7bf9-7b26-2866-c58d394d5441 at NET_0x20000c0a8014b_UUID:0/0
lens
408/360 e 1 to 0 dl 1254678693 ref 1 fl Interpret:/0/0 rc 0/0
Oct 5 00:51:08 MDS1 kernel: Lustre:
21766:0:(ldlm_lockd.c:808:ldlm_server_completion_ast()) ### enqueue wait
took 3339s from 1254675329 ns: mds-lustre-MDT0000_UUID lock:
ffff81017a6b6600/0x2d5a67974b5b63c7 lrc: 3/0,0 mode: PW/PW res:
28456302/2724672227 rrc: 3 type: FLK pid: 1206 [0->9223372036854775807]
flags: 0x0 remote: 0x9b8763ff474660b1 expref: 185 pid: 19262 timeout: 0
Oct 5 00:51:31 MDS1 kernel: LustreError:
19268:0:(handler.c:1554:mds_handle()) operation 400 on unconnected MDS from
12345-192.168.1.75 at tcp
Oct 5 00:51:31 MDS1 kernel: LustreError:
19268:0:(ldlm_lib.c:1826:target_send_reply_msg()) @@@ processing error
(-107) req at ffff810136d09800 x1315300608351964/t0
o400-><?>@<?>:0/0 lens
192/0 e 0 to 0 dl 1254678697 ref 1 fl Interpret:H/0/0 rc -107/0
Oct 5 00:51:53 MDS1 kernel: Lustre:
3408:0:(socklnd_cb.c:2728:ksocknal_check_peer_timeouts()) Stale ZC_REQs for
peer 192.168.1.75 at tcp detected: 1; the oldest (ffff810116084ac0) timed out 0
secs ago
Oct 5 00:51:53 MDS1 kernel: Lustre: Request x1314965658587167 sent from
lustre-MDT0000 to NID 192.168.1.75 at tcp 45s ago has timed out (limit 45s).
Oct 5 00:51:53 MDS1 kernel: LustreError: 138-a: lustre-MDT0000: A client on
nid 192.168.1.75 at tcp was evicted due to a lock completion callback to
192.168.1.75 at tcp timed out: rc -107
Oct 5 00:52:05 MDS1 kernel: Lustre:
3408:0:(socklnd_cb.c:2728:ksocknal_check_peer_timeouts()) Stale ZC_REQs for
peer 192.168.1.75 at tcp detected: 1; the oldest (ffff810116084ac0) timed out
12 secs ago
Oct 5 00:52:18 MDS1 kernel: Lustre:
3408:0:(socklnd_cb.c:2728:ksocknal_check_peer_timeouts()) Stale ZC_REQs for
peer 192.168.1.75 at tcp detected: 1; the oldest (ffff810116084ac0) timed out
25 secs ago
Oct 5 00:52:30 MDS1 kernel: Lustre:
3408:0:(socklnd_cb.c:2728:ksocknal_check_peer_timeouts()) Stale ZC_REQs for
peer 192.168.1.75 at tcp detected: 1; the oldest (ffff810116084ac0) timed out
37 secs ago
Oct 5 00:52:32 MDS1 kernel: Lustre:
21766:0:(service.c:1319:ptlrpc_server_handle_request()) @@@ Request
x1315300608351958 took longer than estimated (31+59s); client may timeout.
req at ffff81018360ac00 x1315300608351958/t0
o37->039bd606-7bf9-7b26-2866-c58d394d5441 at NET_0x20000c0a8014b_UUID:0/0
lens
408/360 e 1 to 0 dl 1254678693 ref 1 fl Complete:/0/0 rc -110/-110
Oct 5 00:55:20 MDS1 kernel: Lustre: MGS: haven''t heard from client
91e0da15-85fa-166e-4633-d907fdc1fdc6 (at 192.168.1.75 at tcp) in 229 seconds. I
think it''s dead, and I am evicting it.
Oct 5 00:55:23 MDS1 kernel: Lustre: lustre-MDT0000: haven''t heard from
client 039bd606-7bf9-7b26-2866-c58d394d5441 (at 192.168.1.75 at tcp) in 232
seconds. I think it''s dead, and I am evicting it.
Oct 5 00:55:23 MDS1 kernel: LustreError:
19238:0:(lprocfs_status.c:1644:lprocfs_exp_cleanup()) obd_export''s
client
uuid 039bd606-7bf9-7b26-2866-c58d394d5441 are not found in its nid_stats
list
It''s look like the disconnect tcp between MDS and client, but i check
lctl
ping, it''s OK. The except, MDS (or other nodes) that can''t ssh
to this
client. My provisional measure is reboot this client. It works OK. But to a
certain time, it happens again ( can happen with other clients).
I don''t think, the cause is network. Because, my network is still
stable.
what''s happen with my system and how to fix it?
I know the problem client evicted has been discussed a lot, but I do not
find a way to solve my problem.
thanks,
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091005/da30d5a0/attachment.html