Hello all.
I''m just getting started with Lustre. I had it set up by the supplier
on
delivery and I''m seeing a problem and seek advice about a
troubleshooting procedure to find and fix it.
Our setup
---------
As lustre servers we''re running two Dell 2950 boxes with SuSE 10.1
I''ll call them servername1 and servername2. I believe servername1 to be
primary.
uname -a gives:
Linux servername1 2.6.16-27-0.9_lustre-1.6.0.1smp #1 SMP Tue May 29
20:19:31 BST 2007 x86_64 x86_64 x86_64 GNU/Linux
These are both fibrechannel attached to two Xyratex boxes of SATA disks
in RAID5 configuration, and we have heartbeat set up so if one lustre
server goes down the other can serve from both storage boxes. StorView
shows both disk arrays running with no errors. One has logical volumes:
MDS0, LD0-1, LD0-2, LD0-3, LD4 (this name seems inconsistent)
the other:
MDS1, LD1-1, LD1-2, LD1-3, LD1-4
Our client machine is another Dell 2950 networked to the two lustre
servers using gigabit ethernet. It runs Red Hat Enterprise Linux WS
release 4 (Nahant Update 5)
uname -a gives:
Linux clientname 2.6.9-42.0.10.EL_lustre-1.6.0.1smp #1 SMP Thu May 3
20:37:18 MDT 2007 x86_64 x86_64 x86_64 GNU/Linux
It mounts the lustre filesystem in /etc/fstab with:
servername1:servername2:/data /disk/00 lustre _netdev 0 0
This has been working fine, if not at high speed but recently an rsync
process hung.
The machine still shows the filesystem mounted but hangs on access.
dmesg gives:
Lustre: Changing connection for data-OST0007-osc-0000010420740400 to
143.210.36.82@tcp/143.210.36.82@tcp
Lustre: Skipped 47 previous similar messages
LustreError: 3251:0:(client.c:574:ptlrpc_check_status()) @@@ type
=PTL_RPC_MSG_ERR, err == -16 req@0000010420459c00 x117468083/t0
o8->data-OST0007_UUID@143.210.36.82@tcp:28 lens 304/328 ref 1 fl
Rpc:R/0/0 rc 0/-16
LustreError: 3251:0:(client.c:574:ptlrpc_check_status()) Skipped 47
previous similar messages
Lustre: Changing connection for data-OST0007-osc-0000010420740400 to
143.210.36.82@tcp/143.210.36.82@tcp
Lustre: Skipped 47 previous similar messages
LustreError: 3251:0:(client.c:574:ptlrpc_check_status()) @@@ type
=PTL_RPC_MSG_ERR, err == -16 req@0000010037d57800 x117468347/t0
o8->data-OST0007_UUID@143.210.36.82@tcp:28 lens 304/328 ref 1 fl
Rpc:R/0/0 rc 0/-16
LustreError: 3251:0:(client.c:574:ptlrpc_check_status()) Skipped 47
previous similar messages
Lustre: Changing connection for data-OST0007-osc-0000010420740400 to
143.210.36.82@tcp/143.210.36.82@tcp
Lustre: Skipped 47 previous similar messages
LustreError: 3251:0:(client.c:574:ptlrpc_check_status()) @@@ type
=PTL_RPC_MSG_ERR, err == -16 req@00000100cff7ba00 x117468611/t0
o8->data-OST0007_UUID@143.210.36.82@tcp:28 lens 304/328 ref 1 fl
Rpc:R/0/0 rc 0/-16
LustreError: 3251:0:(client.c:574:ptlrpc_check_status()) Skipped 47
previous similar messages
etc.
One lustre server, servername1, has mounted:
/dev/sdb on /mdt type lustre (rw)
/dev/sdc on /ost0 type lustre (rw)
/dev/sdd on /ost1 type lustre (rw)
/dev/sde on /ost2 type lustre (rw)
/dev/sdf on /ost3 type lustre (rw)
the other, servername2, has mounted:
/dev/sdh on /ost4 type lustre (rw)
/dev/sdi on /ost5 type lustre (rw)
/dev/sdj on /ost6 type lustre (rw)
/dev/sdk on /ost7 type lustre (rw)
dmesg on on servername1 gives:
LustreError: 5063:0:(ldlm_lib.c:1363:target_send_reply_msg()) @@@
processing error (-19) req@ffff81006a7d8800 x117469614/t0
o8-><?>@<?>:-1
lens 304/0 ref 0 fl Interpret:/0/0 rc -19/0
LustreError: 5063:0:(ldlm_lib.c:1363:target_send_reply_msg()) Skipped 24
previous similar messages
LustreError: 5416:0:(client.c:517:ptlrpc_import_delay_req()) @@@
IMP_INVALID req@ffff8100bdfa4800 x71606584/t0
o101->MGS@143.210.36.86@tcp:26 lens 232/240 ref 1 fl Rpc:/0/0 rc 0/0
LustreError: 5416:0:(client.c:517:ptlrpc_import_delay_req()) Skipped 79
previous similar messages
LustreError: 5078:0:(ldlm_lib.c:576:target_handle_connect()) @@@ UUID
''data-OST0007_UUID'' is not available for connect (no target)
req@ffff81012ac90e00 x117469889/t0 o8-><?>@<?>:-1 lens 304/0 ref
0 fl
Interpret:/0/0 rc 0/0
LustreError: 5078:0:(ldlm_lib.c:576:target_handle_connect()) Skipped 24
previous similar messages
LustreError: 5078:0:(ldlm_lib.c:1363:target_send_reply_msg()) @@@
processing error (-19) req@ffff81012ac90e00 x117469889/t0
o8-><?>@<?>:-1
lens 304/0 ref 0 fl Interpret:/0/0 rc -19/0
LustreError: 5078:0:(ldlm_lib.c:1363:target_send_reply_msg()) Skipped 24
previous similar messages
LustreError: 5416:0:(client.c:517:ptlrpc_import_delay_req()) @@@
IMP_INVALID req@ffff810009b7f000 x71606898/t0
o101->MGS@143.210.36.86@tcp:26 lens 232/240 ref 1 fl Rpc:/0/0 rc 0/0
LustreError: 5416:0:(client.c:517:ptlrpc_import_delay_req()) Skipped 79
previous similar messages
etc.
dmesg on servername2 gives:
LustreError: 9045:0:(ldlm_lib.c:1363:target_send_reply_msg()) @@@
processing error (-16) req@ffff81011a47c600 x117469183/t0
o8->a2a1f3f1-d6f0-a6c9-9caa-63d5f8c06478@NET_0x200008fd22469_UUID:-1
lens 304/200 ref 0 fl Interpret:/0/0 rc -16/0
LustreError: 9045:0:(ldlm_lib.c:1363:target_send_reply_msg()) Skipped 24
previous similar messages
Lustre: 9018:0:(ldlm_lib.c:497:target_handle_reconnect()) data-OST0007:
a2a1f3f1-d6f0-a6c9-9caa-63d5f8c06478 reconnecting
Lustre: 9018:0:(ldlm_lib.c:497:target_handle_reconnect()) Skipped 24
previous similar messages
Lustre: 9018:0:(ldlm_lib.c:709:target_handle_connect()) data-OST0007:
refuse reconnection from
a2a1f3f1-d6f0-a6c9-9caa-63d5f8c06478@143.210.36.105@tcp to
0xffff8101045e3000/2
Lustre: 9018:0:(ldlm_lib.c:709:target_handle_connect()) Skipped 24
previous similar messages
LustreError: 8969:0:(ldlm_lib.c:1363:target_send_reply_msg()) @@@
processing error (-16) req@ffff810120b74600 x117469458/t0
o8->a2a1f3f1-d6f0-a6c9-9caa-63d5f8c06478@NET_0x200008fd22469_UUID:-1
lens 304/200 ref 0 fl Interpret:/0/0 rc -16/0
LustreError: 8969:0:(ldlm_lib.c:1363:target_send_reply_msg()) Skipped 24
previous similar messages
Lustre: 8998:0:(ldlm_lib.c:497:target_handle_reconnect()) data-OST0007:
a2a1f3f1-d6f0-a6c9-9caa-63d5f8c06478 reconnecting
Lustre: 8998:0:(ldlm_lib.c:497:target_handle_reconnect()) Skipped 24
previous similar messages
Lustre: 8998:0:(ldlm_lib.c:709:target_handle_connect()) data-OST0007:
refuse reconnection from
a2a1f3f1-d6f0-a6c9-9caa-63d5f8c06478@143.210.36.105@tcp to
0xffff8101045e3000/2
Lustre: 8998:0:(ldlm_lib.c:709:target_handle_connect()) Skipped 24
previous similar messages
etc.
On our client machine:
lfs > osts
OBDS:
0: data-OST0000_UUID ACTIVE
1: data-OST0001_UUID ACTIVE
2: data-OST0002_UUID ACTIVE
3: data-OST0003_UUID ACTIVE
4: data-OST0004_UUID ACTIVE
5: data-OST0005_UUID ACTIVE
6: data-OST0006_UUID ACTIVE
7: data-OST0007_UUID ACTIVE
/disk/00 has no stripe info
lfs > check servers
data-MDT0000-mdc-0000010420740400 active.
data-OST0000-osc-0000010420740400 active.
data-OST0001-osc-0000010420740400 active.
data-OST0002-osc-0000010420740400 active.
data-OST0003-osc-0000010420740400 active.
data-OST0004-osc-0000010420740400 active.
data-OST0005-osc-0000010420740400 active.
data-OST0006-osc-0000010420740400 active.
and then hangs.
servername1 if I do a "ps auxw" gives
ll_ost_io processes to ll_ost_io_396:
and ll_ost processes to ll_ost_128
root 1841 0.0 0.0 0 0 ? S Jun22 0:00
[ll_ost_io_396]
servername2 shows:
...
root 694 0.0 0.0 0 0 ? S Jun22 0:00
[ll_ost_io_511]
root 695 0.0 0.0 0 0 ? D Jun22 0:00
[ll_ost_io_512]
so there''s a difference and servername2 has something in disk wait.
I''d much appreciate some clues as to how to proceed.
Thank you,
Grant
--
Grant Denkinson <gd41@star.le.ac.uk>
Department of Physics & Astronomy, University of Leicester