Hi all, The Lustre FS has crashed after the entire system was rebooted. Here are some error messages as follows: On n01-ib0 (one of the clients) ------------------------------- # /usr/sbin/lconf --node n01-ib0 /etc/lustre/config.xml MDC: MDC_n01.local_mds_master2_MNT_n01-ib0_2 23503_MNT_n01-ib0_2_a7896cb070 mds_master2_UUID MDC: MDC_n01.local_mds_master2_MNT_n01-ib0_2 23503_MNT_n01-ib0_2_a7896cb070 ! /usr/sbin/lctl (255): IOC_PORTAL_DEL_UUID failed: Invalid argument # dmesg ERROR : IPOIB_UD : ipoib_ud_find_dev_by_dst:(ipoib_ud_arp.c) :ip_route_output_key(127.0.0.1) failed ... ... LustreError: 5208:0:(client.c:947:ptlrpc_expire_one_request()) @@@ timeout (sent at 1200501512, 5s ago) req at 0000010117c34600 x1/t0 o38->mds_master2_UUID at s03-ib0_UUID:12 lens 240/272 ref 1 fl Rpc:/0/0 rc 0/0 Lustre: Changing connection for MDC_n01.local_mds_master2_MNT_n01-ib0_2 to s04-ib0_UUID/11.0.0.249 at vib LustreError: 5208:0:(client.c:947:ptlrpc_expire_one_request()) @@@ timeout (sent at 1200501517, 5s ago) req at 0000010117c34600 x3/t0 o38->mds_master2_UUID at s04-ib0_UUID:12 lens 240/272 ref 1 fl Rpc:/0/0 rc 0/0 Lustre: Changing connection for MDC_n01.local_mds_master2_MNT_n01-ib0_2 to s03-ib0_UUID/11.0.0.250 at vib ... ... Lustre: Skipped 39 previous similar messages ERROR : AD_TAVOR : vvi_mlx_poll_for_completion:(adaptor_tavor.c):VLT: completion_status: 10 (MLX: 12, syndrom: 129), total err num: 5 (not print flush errors) LustreError: 4941:0:(events.c:53:request_out_callback()) @@@ type 4, status -113 req at 000001011294f200 x2794/t0 o38->mds_master2_UUID at s03-ib0_UUID:12 lens 240/272 ref 2 fl Rpc:/0/0 rc 0/0 LustreError: 4941:0:(events.c:53:request_out_callback()) Skipped 14 previous similar messages LustreError: 23731:0:(obd_config.c:333:class_cleanup()) OBD MDC_n01.local_mds_master2_MNT_n01-ib0_2 is still busy with 5 references You should stop active file system users, or use the --force option to cleanup. LustreError: 23731:0:(obd_config.c:234:class_detach()) OBD device 2 still set up LustreError: 23732:0:(lustre_peer.c:148:class_del_uuid()) delete non-existent uuid s03-ib0_UUID On s03-ib0 (failover MDS with s04-ib0) -------------------------------------- # traceroute 11.0.0.1 traceroute to 11.0.0.1 (11.0.0.1), 30 hops max, 46 byte packets 1 n01-ib0 (11.0.0.1) 0.149 ms 0.086 ms 0.088 ms # dmesg ERROR : IPOIB_UD : ipoib_ud_find_dev_by_dst:(ipoib_ud_arp.c): ip_route_output_key(127.0.0.1) failed new: ipoib_allow_arp_joins: 1 Linux Kernel Card Services options: [pci] [cardbus] [pm] ERROR : IPOIB_UD : ipoib_ud_find_dev_by_dst:(ipoib_ud_arp.c): ip_route_output_key(11.0.0.4) failed ... ... Lustre: Added LNI 11.0.0.250 at vib [8/128] Lustre: 4362:0:(lib-move.c:1644:lnet_parse_put()) Dropping PUT from 12345-11.0.0.3 at vib portal 12 match 3734 offset 0 length 240: 2 Lustre: 4362:0:(lib-move.c:1644:lnet_parse_put()) Dropping PUT from 12345-11.0.0.15 at vib portal 12 match 3736 offset 0 length 240: 2 The error messages like ''ip_route_output_key(*) failed'' means there is probably wrong routing IPOIB interface configuration. But both IPOIB Interface configuration and node routing table seem to be OK. Any help would be greatly appreciated. -- Regards, Changer -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080117/0d34aedd/attachment-0002.html