Gerd
2009-Mar-12 15:29 UTC
[Lustre-discuss] ksocknal_process_receive() Error -14 / Error -14 on read from ...
Hi, We have a 1.6.6 installation using InfiniBand attached DDN OST storage and OSS''es connected to the network with 10GE adapters. When running iozone with ~40 1GE attached clients we see the following on the clients: Mar 12 14:42:46 com01-06 kernel: LustreError: 4193:0:(events.c:194:client_bulk_callback()) event type 0, status -5, desc ffff8100a01c4000 Mar 12 14:42:46 com01-06 kernel: LustreError: 4193:0:(events.c:194:client_bulk_callback()) event type 0, status -5, desc ffff810050164000 Mar 12 14:42:46 com01-06 kernel: LustreError: 4193:0:(events.c:194:client_bulk_callback()) event type 0, status -5, desc ffff81031b920000 Mar 12 14:42:46 com01-06 kernel: LustreError: 4193:0:(events.c:194:client_bulk_callback()) event type 0, status -5, desc ffff81032192a000 Mar 12 14:42:46 com01-06 kernel: LustreError: 4193:0:(events.c:194:client_bulk_callback()) event type 0, status -5, desc ffff81001b20c000 Mar 12 14:42:46 com01-06 kernel: LustreError: 4193:0:(events.c:194:client_bulk_callback()) event type 0, status -5, desc ffff810128406000 Mar 12 14:42:46 com01-06 kernel: LustreError: 4193:0:(events.c:194:client_bulk_callback()) event type 0, status -5, desc ffff81018c6c2000 Mar 12 14:42:46 com01-06 kernel: LustreError: 4193:0:(events.c:194:client_bulk_callback()) event type 0, status -5, desc ffff810067fce000 Mar 12 14:42:46 com01-06 kernel: LustreError: 4193:0:(events.c:194:client_bulk_callback()) event type 0, status -5, desc ffff8102a7c62000 Mar 12 14:42:46 com01-06 kernel: LustreError: 4193:0:(events.c:66:request_out_callback()) @@@ type 4, status -5 req at ffff81037f08b000 x35161916/t0 o4->test1-OST0008_UUID at 172.23.125.14@tcp:6/4 lens 384/480 e 0 to 100 dl 1236869066 ref 3 fl Rpc:/0/0 rc 0/0 Mar 12 14:42:46 com01-06 kernel: LustreError: 4193:0:(events.c:66:request_out_callback()) Skipped 11 previous similar messages Mar 12 14:42:46 com01-06 kernel: Lustre: Request x35161916 sent from test1-OST0008-osc-ffff810324e8e000 to NID 172.23.125.14 at tcp 0s ago has timed out (limit 100s). Mar 12 14:42:46 com01-06 kernel: Lustre: Skipped 8 previous similar messages Mar 12 14:42:46 com01-06 kernel: Lustre: test1-OST0008-osc-ffff810324e8e000: Connection to service test1-OST0008 via nid 172.23.125.14 at tcp was lost; in progress operations using this service will wait for recovery to complete. And this on the OSS: Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError: 5469:0:(socklnd_cb.c:1291:ksocknal_process_receive()) [ffff81001f6fc000] Error -14 on read from 12345-172.23.98.133 at tcp ip 172.23.98.133:1021 Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError: 5469:0:(socklnd_cb.c:1291:ksocknal_process_receive()) Skipped 5 previous similar messages Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError: 5481:0:(socklnd.c:1631:ksocknal_destroy_conn()) Completing partial receive from 12345-172.23.98.133 at tcp, ip 172.23.98.133:1021, with error Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError: 5481:0:(socklnd.c:1631:ksocknal_destroy_conn()) Skipped 4 previous similar messages Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError: 5481:0:(events.c:372:server_bulk_callback()) event type 2, status -5, desc ffff810049430000 Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError: 6699:0:(ost_handler.c:1153:ost_brw_write()) @@@ network error on bulk GET 0(1048576) req at ffff8100779 2dc50 x35161902/t0 o4->8ec45cac-9f38-63c9-eb19-b4bad0242b73 at NET_0x20000ac176285_UUID:0/0 lens 384/352 e 0 to 0 dl 1236869066 ref 1 fl Interpret:/0/0 rc 0/0 Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError: 6699:0:(ost_handler.c:1153:ost_brw_write()) Skipped 4 previous similar messages Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError: 5481:0:(events.c:372:server_bulk_callback()) event type 2, status -5, desc ffff8100528b2000 Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: Lustre: 6680:0:(ost_handler.c:1284:ost_brw_write()) test1-OST0010: ignoring bulk IO comm error with bfb4f76d-1090-a175-89cd-7f51df10cc68 at NET_0x20000ac17628d_UUID id 12345-172.23.98.141 at tcp - client will retry Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: Lustre: 6680:0:(ost_handler.c:1284:ost_brw_write()) Skipped 85 previous similar messages Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError: 5481:0:(events.c:372:server_bulk_callback()) event type 2, status -5, desc ffff8100633fa000 Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError: 5481:0:(events.c:372:server_bulk_callback()) event type 2, status -5, desc ffff81007ea56000 Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError: 5481:0:(events.c:372:server_bulk_callback()) event type 2, status -5, desc ffff8100690ea000 Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError: 5481:0:(events.c:372:server_bulk_callback()) event type 2, status -5, desc ffff810044aa0000 Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: Lustre: 6509:0:(ldlm_lib.c:538:target_handle_reconnect()) test1-OST0008: 8ec45cac-9f38-63c9-eb19-b4bad0242b73 reconnecting Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: Lustre: 6509:0:(ldlm_lib.c:538:target_handle_reconnect()) Skipped 8 previous similar messages Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: Lustre: 6509:0:(ldlm_lib.c:773:target_handle_connect()) test1-OST0008: refuse reconnection from 8ec45cac-9f38-63c9 -eb19-b4bad0242b73 at 172.23.98.133@tcp to 0xffff810023258000; still busy with 12 active RPCs Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: Lustre: 6509:0:(ldlm_lib.c:773:target_handle_connect()) Skipped 5 previous similar messages What could explain this behaviour? What is Error 14? Gerd.
Isaac Huang
2009-Mar-12 19:32 UTC
[Lustre-discuss] ksocknal_process_receive() Error -14 / Error -14 on read from ...
On Thu, Mar 12, 2009 at 03:29:40PM +0000, Gerd wrote:> Hi, > > We have a 1.6.6 installation using InfiniBand attached DDN OST storage > and OSS''es connected to the network with 10GE adapters. When running > iozone with ~40 1GE attached clients we see the following on the clients: > ...... > And this on the OSS: > > Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError: > 5469:0:(socklnd_cb.c:1291:ksocknal_process_receive()) [ffff81001f6fc000] > Error -14 on read from 12345-172.23.98.133 at tcp ip 172.23.98.133:1021 > Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError: > 5469:0:(socklnd_cb.c:1291:ksocknal_process_receive()) Skipped 5 previous > similar messagesInteresting, socket read function from TCP stack returned an EFAULT, which usually has something to do with userland memory access and permission stuff. Have you seen this error on other servers? What was the kernel version of the OSS? When did it happen during the iozone test? Were you able to reproduce it? Do you have any network related security module (e.g. like LSM) running on the server? Isaac
Gerdjan Busker
2009-Mar-13 06:58 UTC
[Lustre-discuss] ksocknal_process_receive() Error -14 / Error -14 on read from ...
Isaac Huang wrote:> On Thu, Mar 12, 2009 at 03:29:40PM +0000, Gerd wrote: > >> Hi, >> >> We have a 1.6.6 installation using InfiniBand attached DDN OST storage >> and OSS''es connected to the network with 10GE adapters. When running >> iozone with ~40 1GE attached clients we see the following on the clients: >> ...... >> And this on the OSS: >> >> Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError: >> 5469:0:(socklnd_cb.c:1291:ksocknal_process_receive()) [ffff81001f6fc000] >> Error -14 on read from 12345-172.23.98.133 at tcp ip 172.23.98.133:1021 >> Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError: >> 5469:0:(socklnd_cb.c:1291:ksocknal_process_receive()) Skipped 5 previous >> similar messages >> > > Interesting, socket read function from TCP stack returned an EFAULT, > which usually has something to do with userland memory access and > permission stuff. > > Have you seen this error on other servers? What was the kernel version of > the OSS? When did it happen during the iozone test? Were you able to > reproduce it? Do you have any network related security module (e.g. > like LSM) running on the server? >This is on 2.6.18-92.1.10.el5_lustre.1.6.6. We see this error shortly after starting iozone on all 8x OSS''es and (I believe) on most clients. I''m was thinking kernel limits or so, but we have done iozone runs where this doesn''t happen. Gerd.