OK, Redundancy works for me sometimes, but not others. I used Heartbeat, and I failed the Active MDS (10.200.1.251) to the Standby MDS (10.200.1.252). I did a "cat /mnt/lustre/*" on the client (10.200.1.40), and the client is hung. The OST doesn’t even try to contact the MDS, as far as I can tell. Here are the key parts of /var/log/messages on my three systems. Does anyone know what is wrong, and how I can fix it? -Roger On MDS: Why is it giving me an error when the client tries to connect? Aug 11 17:29:33 roger-ha-2 heartbeat: info: mach_down takeover complete for node roger-ha-1. Aug 11 17:29:33 roger-ha-2 heartbeat[10093]: info: Exiting status process 10150 returned rc 0. Aug 11 17:29:33 roger-ha-2 heartbeat[10093]: debug: RscMgmtProc ''status'' exited code 0 Aug 11 17:29:41 roger-ha-2 kernel: LustreError: 11139:0:(lib-move.c:152:lnet_match_md()) Dropping PUT from 12345-10.200.1.40@tcp portal 12 match 0x3a offset 0 length 240: no match Aug 11 17:31:50 roger-ha-2 kernel: LustreError: 11139:0:(lib-move.c:152:lnet_match_md()) Dropping PUT from 12345-10.200.1.40@tcp portal 12 match 0x44 offset 0 length 240: no match On OST: Why doesn''t it try to connect to Standby MDS? Aug 11 17:29:09 blade-lustre2 kernel: Lustre: A connection with 10.200.1.251 tim ed out; the network or that node may be down. Aug 11 17:29:09 blade-lustre2 kernel: Lustre: 6767:0:(router.c:184:lnet_notify() ) Upcall: NID 10.200.1.251@tcp is dead Aug 11 17:29:09 blade-lustre2 kernel: Lustre: 4:0:(linux-debug.c:96:libcfs_run_u pcall()) Invoked portals upcall /usr/lib/lustre/lnet_upcall ROUTER_NOTIFY,10.200 ..1.251@tcp,down,1155331715 Aug 11 17:32:38 blade-lustre2 kernel: Lustre: ost1-ts: haven''t heard from 10.200 ..1.251@tcp in 243 seconds. Last request was at 1155331715. I think it''s dead, an d I am evicting it. (I didn''t see any more output for several minutes) On OSC: Why can''t it find the MDS? Aug 11 17:29:21 blade-lustre0 kernel: Lustre: A connection with 10.200.1.251 tim ed out; the network or that node may be down. Aug 11 17:29:21 blade-lustre0 kernel: Lustre: 5539:0:(router.c:184:lnet_notify() ) Upcall: NID 10.200.1.251@tcp is dead Aug 11 17:29:21 blade-lustre0 kernel: Lustre: 4:0:(linux-debug.c:96:libcfs_run_u pcall()) Invoked portals upcall /usr/lib/lustre/lnet_upcall ROUTER_NOTIFY,10.200 ..1.251@tcp,down,1155331702 Aug 11 17:29:39 blade-lustre0 kernel: Lustre: Changing connection for MDC_blade- lustre0_ha-mds_MNT_client to roger-ha-2_UUID Aug 11 17:31:44 blade-lustre0 kernel: Lustre: Changing connection for MDC_blade- lustre0_ha-mds_MNT_client to roger-ha-1_UUID Aug 11 17:31:47 blade-lustre0 kernel: Lustre: Changing connection for MDC_blade- lustre0_ha-mds_MNT_client to roger-ha-2_UUID Aug 11 17:33:52 blade-lustre0 kernel: Lustre: Changing connection for MDC_blade- lustre0_ha-mds_MNT_client to roger-ha-1_UUID Aug 11 17:35:36 blade-lustre0 kernel: Lustre: Changing connection for MDC_blade- lustre0_ha-mds_MNT_client to roger-ha-1_UUID Aug 11 17:35:36 blade-lustre0 kernel: Lustre: previously skipped 1 similar messa ges _________________________________________________________________ Is your PC infected? Get a FREE online computer virus scan from McAfee® Security. http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963