helo Everybody! I have a strange problem with my cluster. Yesterday I saw, node3 of my lustre cluster (it''s the pair of node4 of the heartbeat+drbd cluster) was freezed up and node4 didn''t took over the OST. After reboot it always wrote ''System halted.'' on console, but it cannot be down. I disconnected node3, rebooted node4, and everything worked fine. Today, I tried to make it work as before with a fresh system with CentOS 4.4, drbd 0.7.25, lustre 1.6.4.1. The array drbd1, which is originally primary on node4 went fine. node4: 0: cs:StandAlone st:Primary/Unknown ld:Consistent ns:0 nr:0 dw:15404660 dr:88550854 al:11773 bm:11773 lo:0 pe:0 ua:0 ap:0 node3: 0: cs:WFConnection st:Secondary/Unknown ld:Consistent ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 # drbdadm --dry-run wait_connect ost-3 drbdsetup /dev/drbd0 wait_connect --wfc-timeout=120 --degr-wfc-timeout=120 It said: Aborting. drbdadm connect ost-3 -> in messages log I saw: Jan 24 09:31:37 node4 kernel: drbd0: drbdsetup [8135]: cstate StandAlone --> Unconnected Jan 24 09:31:37 node4 kernel: drbd0: drbd0_receiver [8136]: cstate Unconnected --> WFConnection Jan 24 09:31:39 node4 kernel: drbd0: drbd0_receiver [8136]: cstate WFConnection --> WFReportParams Jan 24 09:31:39 node4 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74 Jan 24 09:31:39 node4 kernel: drbd0: Connection established. Jan 24 09:31:39 node4 kernel: drbd0: I am(P): 1:00000003:00000003:00000053:00000003:10 Jan 24 09:31:39 node4 kernel: drbd0: Peer(S): 1:00000007:00000003:0000004a:00000004:00 Jan 24 09:31:39 node4 kernel: drbd0: Current Primary shall become sync TARGET! Aborting to prevent data corruption. Jan 24 09:31:39 node4 kernel: drbd0: drbd0_receiver [8136]: cstate WFReportParams --> StandAlone Jan 24 09:31:39 node4 kernel: drbd0: error receiving ReportParams, l: 72! Jan 24 09:31:39 node4 kernel: drbd0: asender terminated Jan 24 09:31:39 node4 kernel: drbd0: worker terminated Jan 24 09:31:39 node4 kernel: drbd0: drbd0_receiver [8136]: cstate StandAlone --> StandAlone Jan 24 09:31:39 node4 kernel: drbd0: Connection lost. Jan 24 09:31:39 node4 kernel: drbd0: receiver terminated Why didn''t work it? I wanted to make node4 to be SyncSource, node3 behaved fine and was listening on the right port with cstate WFConnection. Than I made a mistake, disabled hertbeat and rebooted node4. Well, both node was Secondary, and they started to sync, node3 was the SyncSource. Why? What could be the right command? So the get synced. And after that, I don''t know exactly, when node4 started to behave like node3 yesterday, it wrote ''System haled'' and everything stopped to work. I stoped heartbeat, reset, mount ost by hand, and now it looks fine, but who know, now I''m a bit paranoid. Still I have to say, node3''s kernel was 1.6.0.1 with drbd 0.7.22 (but 0.7.25 userland) until the last reboot above, I don''t know, it could cause a problem, or not. Does anybody have an idea, what happened, what would have to make with any part of the history? Thank you, tamas
On Thu, 2008-01-24 at 19:26 +0100, Papp Tam?s wrote:> helo Everybody!Hi.> Does anybody have an idea, what happened, what would have to make with > any part of the history?I''m not sure how many people here use drbd or are expert enough on it to answer your questions, but perhaps a more drbd focused audience might yield more results from your query. Check out http://www.drbd.org/mailinglist.html which is a link right off of the main drbd page at http://www.drbd.org/. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080125/728ebbfb/attachment-0002.bin
Brian J. Murrell wrote:> I''m not sure how many people here use drbd or are expert enough on it to > answer your questions, but perhaps a more drbd focused audience might > yield more results from your query. Check out > http://www.drbd.org/mailinglist.html which is a link right off of the > main drbd page at http://www.drbd.org/. >Yes, I know, but there were other errors, and I hoped, maybe I can get some help here. Since this crash this two nodes doesn''t work correctly at all. I see this in the log: Jan 25 15:29:03 node3 kernel: LustreError: 4788:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at f2f53c00 x17596/t0 o101->MGS at MGC192.168.2.1@tc p_0:26 lens 232/240 ref 1 fl Rpc:/0/0 rc 0/0 Jan 25 15:29:03 node3 kernel: LustreError: 4788:0:(client.c:519:ptlrpc_import_delay_req()) Skipped 19 previous similar messages Jan 25 15:34:12 node3 kernel: LustreError: 137-5: UUID ''hallmark-OST0004_UUID'' is not available for connect (no target) Jan 25 15:34:12 node3 kernel: LustreError: 4625:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-19) req at c4cc6800 x18/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0 rc -19/0 Jan 25 15:34:36 node3 kernel: Lustre: 32315:0:(ldlm_lib.c:519:target_handle_reconnect()) hallmark-OST0003: 926f3926-700d-ecec-1aef-23e026c62343 reconnecting Jan 25 15:36:06 node3 kernel: Lustre: hallmark-OST0003: haven''t heard from client e03956aa-95bd-4a44-8486-90819afd4033 (at 192.168.0.21 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Jan 25 15:39:46 node3 kernel: LustreError: 4788:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at c43ab600 x17687/t0 o101->MGS at MGC192.168.2.1@tcp_0:26 lens 232/240 ref 1 fl Rpc:/0/0 rc 0/0 Jan 25 15:39:46 node3 kernel: LustreError: 4788:0:(client.c:519:ptlrpc_import_delay_req()) Skipped 19 previous similar messages Jan 25 15:45:16 node3 kernel: Lustre: hallmark-OST0003: haven''t heard from client 09a8da79-1b1f-dcb9-197d-28cc16213dd5 (at 192.168.0.112 at tcp) in 227 seconds I think it''s dead, and I am evicting it. Jan 25 15:50:11 node3 kernel: Lustre: hallmark-OST0003: haven''t heard from client d7aaa9e0-e4db-0013-2556-b594e735eba4 (at 192.168.0.181 at tcp) in 228 seconds I think it''s dead, and I am evicting it. Jan 25 15:50:30 node3 kernel: LustreError: 4788:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at c1b2c200 x17716/t0 o101->MGS at MGC192.168.2.1@tcp_0:26 lens 232/240 ref 1 fl Rpc:/0/0 rc 0/0 Jan 25 15:50:30 node3 kernel: LustreError: 4788:0:(client.c:519:ptlrpc_import_delay_req()) Skipped 19 previous similar messages And this: Jan 25 15:18:38 node3 kernel: Lustre: 4629:0:(ldlm_lib.c:519:target_handle_reconnect()) hallmark-OST0003: 36d92f10-2f06-8b84-74d9-4324f3a8bf52 reconnecting Jan 25 15:19:05 node3 kernel: LustreError: 137-5: UUID ''hallmark-OST0004_UUID'' is not available for connect (no target) Any idea? tamas