nic@cray.com
2006-Dec-19 09:10 UTC
[Lustre-devel] [Bug 11308] Linux compute nodes sending illegal matchbits?
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11308 I might have been a wee bit impatient -- with the default obd_timeout of 100s, one needs to wait a bit longer for the EIO to kick out. So -- this means so far so good testing this. I''m going to do some del_peer''ing during "real" Lustre I/O to make sure things are reconnecting properly as well.
nic@cray.com
2006-Dec-19 16:18 UTC
[Lustre-devel] [Bug 11308] Linux compute nodes sending illegal matchbits?
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11308 The patch tested out just fine. Once question I have: when a client gets a NAK back from the server, it doesn''t seem that it returns EIO up the stack. Instead, the reply seems to get dropped on the floor and we let the ptlrpc level timeouts hit before it is resent. With longer timeouts (300s), this could make for a long time between us seeing the NAK & actually resending the RPC> What is the expected behavior here? Example syslog traffic when "lctl --net ptl del_peer" run on an OST (nid00028) while a dd was running on nid00007: Dec 19 17:14:32 nid00028 kernel: Lustre: 3737:0:(ptllnd_rx_buf.c:569:kptllnd_rx_parse()) NAK 12345-7@ptl: no connection; peer must reconnect Dec 19 17:14:32 nid00007 kernel: Lustre: 4015:0:(ptllnd_rx_buf.c:539:kptllnd_rx_parse()) NAK from 12345-28@ptl (ptlid:9-28) Dec 19 17:14:32 nid00007 kernel: Lustre: 4016:0:(router.c:184:lnet_notify()) Upcall: NID 28@ptl is dead Dec 19 17:14:32 nid00007 kernel: Lustre: 4:0:(linux-debug.c:96:libcfs_run_upcall()) Invoked portals upcall /usr/lib/lustre/lnet_upcall ROUTER_NOTIFY,28@ptl,down,1166570069 Dec 19 17:14:43 nid00028 kernel: Lustre: 4890:0:(ldlm_lib.c:489:target_handle_reconnect()) ost_svc: 93f03b41-ebd5-4daa-8f75-eb981390f46e reconnecting Dec 19 17:14:43 nid00003 kernel: LustreError: 15069:0:(client.c:955:ptlrpc_expire_one_request()) @@@ timeout (sent at 1166570064, 15s ago) [out 1166570064.782602, in 0.000000] req@000001006db8ac00 x2292/t0 o400->ost_svc_UUID@ost_facet_UUID:28 lens 64/64 ref 1 fl Rpc:N/0/0 rc 0/0 Dec 19 17:14:43 nid00028 kernel: Lustre: 4891:0:(filter.c:2985:filter_set_info_async()) ost_svc: received MDS connection from 3@ptl Dec 19 17:14:43 nid00003 kernel: LustreError: Connection to service ost_svc via nid 28@ptl was lost; in progress operations using this service will wait for recovery to complete. Dec 19 17:14:43 nid00028 kernel: Lustre: 4891:0:(filter.c:2985:filter_set_info_async()) previously skipped 3 similar messages Dec 19 17:14:43 nid00003 kernel: Lustre: OSC_eelc0-0c0s0n3_ost_svc_mds_svc: Connection restored to service ost_svc using nid 28@ptl. Dec 19 17:14:43 nid00003 kernel: Lustre: 15101:0:(mds_lov.c:530:__mds_lov_syncronize()) MDS mds_svc: ost_svc_UUID now active, resetting orphans Dec 19 17:14:43 nid00003 kernel: Lustre: 15101:0:(mds_lov.c:530:__mds_lov_syncronize()) previously skipped 2 similar messages Dec 19 17:14:43 nid00028 kernel: Lustre: 4892:0:(recov_thread.c:580:llog_repl_connect()) llcd 0000010072c03000:00000100711da9c0 not empty Dec 19 17:14:43 nid00028 kernel: Lustre: 4893:0:(filter.c:2364:filter_destroy_precreated()) ost_svc: deleting orphan objects from 886857 to 886981 Dec 19 17:14:43 nid00028 kernel: Lustre: 4893:0:(filter.c:2364:filter_destroy_precreated()) previously skipped 3 similar messages Dec 19 17:14:43 nid00028 kernel: Lustre: 4989:0:(llog_cat.c:352:llog_cat_process_cb()) processing log 0x11050006:37fdedf6 at index 57 of catalog 0x11050002 Dec 19 17:14:43 nid00028 kernel: Lustre: 4989:0:(llog_cat.c:352:llog_cat_process_cb()) previously skipped 1 similar messages Dec 19 17:14:43 nid00028 kernel: Lustre: 4989:0:(filter_log.c:227:filter_recov_log_mds_ost_cb()) fetch generation log, send cookie Dec 19 17:14:43 nid00028 kernel: Lustre: 4989:0:(llog.c:294:llog_process()) recovery from log: 0x11050004:c5056065 stopped
eeb@clusterfs.com
2006-Dec-20 09:23 UTC
[Lustre-devel] [Bug 11308] Linux compute nodes sending illegal matchbits?
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11308 Receiving the NAK make the peer reset its connection state and fail any LNET-level communications currently in progress with the peer. If a client gets a NAK in response to the PUT of an RPC request, most probably it will be after that PUT has completed - i.e. there are no outstanding LNET-level communications to complete with failure. This unfortunately means that no events are delivered to lustre which could tell it that its RPC is in trouble, so it only gets to know about it when the RPC reply fails to arrive within the timeout.