Hello,
we do see those messages (see below) on our OSTs when under heavy _read_ load
(or when 60+ Jobs are trying to read data at approx the same time).
The OSTs freezes and even console output is down to a few bytes the minute.
After some time the OSTs do revocer.
How to interpret those messages to get a clue where to look further ?
1) Is the underlying Raid to slow to handle the amount of requests ?
1a) Is the MDS to slow to follow ?
2) Is it related to network issues (network too slow, too busy ??) ?
3) Would this be cured by upgrading to 1.8.x ?
Status -5 looks to us like I/O errors.
~ # grep ''5'' /usr/include/*/*errno*
/usr/include/asm-generic/errno-base.h:#define EIO 5 /* I/O
error */
But together with the 3ware support we are pretty sure to have replaced all
snipish disks and data transfer looks ok when not used by lustre (i.e. verifying
up to 30-90MB/s/disk throughput).
OSTs with 2GB RAM on 16port controller, 4GB RAM on 24port controller
Raid6 (write once, read often archive system)
lustre-1.6.6
vanilla-kernel 2.6.22.19
3ware 9650se (16 and 24port) latest 9.5.3 Version
Seagate 31000340NS disks, HITACHI 1TB disks
Thanks and Regards
Heiko
Dec 4 12:42:13 sadosrd24 LustreError:
4650:0:(events.c:372:server_bulk_callback()) event type 4, status -5, desc
ffff810070c9e000
Dec 4 12:42:13 sadosrd24 LustreError:
4650:0:(events.c:372:server_bulk_callback()) event type 4, status -5, desc
ffff810076b78000
Dec 4 12:42:13 sadosrd24 LustreError:
4650:0:(events.c:372:server_bulk_callback()) event type 4, status -5, desc
ffff810071e88000
Dec 4 12:42:13 sadosrd24 LustreError:
4650:0:(events.c:372:server_bulk_callback()) event type 4, status -5, desc
ffff810070c80000
Dec 4 12:42:13 sadosrd24 LustreError:
4650:0:(events.c:372:server_bulk_callback()) event type 4, status -5, desc
ffff810076a80000
Dec 4 12:42:56 sadosrd24 LustreError: 4744:0:(ost_handler.c:882:ost_brw_read())
@@@ timeout on bulk PUT after 100+0s req at ffff81007efa7e00 x7869690/t0
o3->eb2e7e64-c1d9-d1f6-8f9d-1ba9629ff4c0 at NET_0x20000c0a8106f_UUID:0/0 lens
384/336 e 0 to 0 dl 1259926976 ref 1 fl Interpret:/0/0 rc 0/0
Dec 4 12:42:56 sadosrd24 Lustre: 4744:0:(ost_handler.c:939:ost_brw_read())
scia-OST001d: ignoring bulk IO comm error with
eb2e7e64-c1d9-d1f6-8f9d-1ba9629ff4c0 at NET_0x20000c0a8106f_UUID id
12345-192.168.16.111 at tcp - client will retry
Dec 4 12:42:57 sadosrd24 LustreError: 4754:0:(ost_handler.c:882:ost_brw_read())
@@@ timeout on bulk PUT after 100+0s req at ffff81007dedb400 x7869691/t0
o3->eb2e7e64-c1d9-d1f6-8f9d-1ba9629ff4c0 at NET_0x20000c0a8106f_UUID:0/0 lens
384/336 e 0 to 0 dl 1259926977 ref 1 fl Interpret:/0/0 rc 0/0
Dec 4 12:42:57 sadosrd24 Lustre: 4754:0:(ost_handler.c:939:ost_brw_read())
scia-OST001d: ignoring bulk IO comm error with
eb2e7e64-c1d9-d1f6-8f9d-1ba9629ff4c0 at NET_0x20000c0a8106f_UUID id
12345-192.168.16.111 at tcp - client will retry
Dec 4 12:43:00 sadosrd24 LustreError: 4757:0:(ost_handler.c:882:ost_brw_read())
@@@ timeout on bulk PUT after 100+0s req at ffff81007e8de000 x7869700/t0
o3->eb2e7e64-c1d9-d1f6-8f9d-1ba9629ff4c0 at NET_0x20000c0a8106f_UUID:0/0 lens
384/336 e 0 to 0 dl 1259926980 ref 1 fl Interpret:/0/0 rc 0/0
Dec 4 12:43:00 sadosrd24 LustreError: 4757:0:(ost_handler.c:882:ost_brw_read())
Skipped 2 previous similar messages
Dec 4 12:43:00 sadosrd24 Lustre: 4757:0:(ost_handler.c:939:ost_brw_read())
scia-OST001d: ignoring bulk IO comm error with
eb2e7e64-c1d9-d1f6-8f9d-1ba9629ff4c0 at NET_0x20000c0a8106f_UUID id
12345-192.168.16.111 at tcp - client will retry
Dec 4 12:43:00 sadosrd24 Lustre: 4757:0:(ost_handler.c:939:ost_brw_read())
Skipped 2 previous similar messages
Dec 4 12:43:16 sadosrd24 LustreError: 4746:0:(ost_handler.c:882:ost_brw_read())
@@@ timeout on bulk PUT after 94+0s req at ffff81006de99850 x7869735/t0
o3->eb2e7e64-c1d9-d1f6-8f9d-1ba9629ff4c0 at NET_0x20000c0a8106f_UUID:0/0 lens
384/336 e 0 to 0 dl 1259926996 ref 1 fl Interpret:/0/0 rc 0/0
Dec 4 12:43:16 sadosrd24 LustreError: 4746:0:(ost_handler.c:882:ost_brw_read())
Skipped 2 previous similar messages
Dec 4 12:43:16 sadosrd24 Lustre: 4746:0:(ost_handler.c:939:ost_brw_read())
scia-OST001d: ignoring bulk IO comm error with
eb2e7e64-c1d9-d1f6-8f9d-1ba9629ff4c0 at NET_0x20000c0a8106f_UUID id
12345-192.168.16.111 at tcp - client will retry
Dec 4 12:43:16 sadosrd24 Lustre: 4746:0:(ost_handler.c:939:ost_brw_read())
Skipped 2 previous similar messages
Dec 4 12:44:13 sadosrd24 Lustre:
4728:0:(ldlm_lib.c:538:target_handle_reconnect()) scia-OST001d:
eb2e7e64-c1d9-d1f6-8f9d-1ba9629ff4c0 reconnecting