helo Everybody!
I have a strange problem with my cluster.
Yesterday I saw, node3 of my lustre cluster (it''s the pair of node4 of 
the heartbeat+drbd cluster) was freezed up and node4 didn''t took over 
the OST.
After reboot it always wrote ''System halted.'' on console, but
it cannot
be down. I disconnected node3, rebooted node4, and everything worked fine.
Today, I tried to make it work as before with a fresh system with CentOS 
4.4, drbd 0.7.25, lustre 1.6.4.1. The array drbd1, which is originally 
primary on node4 went fine.
node4:
 0: cs:StandAlone st:Primary/Unknown ld:Consistent
    ns:0 nr:0 dw:15404660 dr:88550854 al:11773 bm:11773 lo:0 pe:0 ua:0 ap:0
node3:
 0: cs:WFConnection st:Secondary/Unknown ld:Consistent
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
# drbdadm --dry-run wait_connect ost-3
drbdsetup /dev/drbd0 wait_connect --wfc-timeout=120 --degr-wfc-timeout=120
It said: Aborting.
drbdadm connect ost-3 -> in messages log I saw:
Jan 24 09:31:37 node4 kernel: drbd0: drbdsetup [8135]: cstate StandAlone 
--> Unconnected
Jan 24 09:31:37 node4 kernel: drbd0: drbd0_receiver [8136]: cstate 
Unconnected --> WFConnection
Jan 24 09:31:39 node4 kernel: drbd0: drbd0_receiver [8136]: cstate 
WFConnection --> WFReportParams
Jan 24 09:31:39 node4 kernel: drbd0: Handshake successful: DRBD Network 
Protocol version 74
Jan 24 09:31:39 node4 kernel: drbd0: Connection established.
Jan 24 09:31:39 node4 kernel: drbd0: I am(P): 
1:00000003:00000003:00000053:00000003:10
Jan 24 09:31:39 node4 kernel: drbd0: Peer(S): 
1:00000007:00000003:0000004a:00000004:00
Jan 24 09:31:39 node4 kernel: drbd0: Current Primary shall become sync 
TARGET! Aborting to prevent data corruption.
Jan 24 09:31:39 node4 kernel: drbd0: drbd0_receiver [8136]: cstate 
WFReportParams --> StandAlone
Jan 24 09:31:39 node4 kernel: drbd0: error receiving ReportParams, l: 72!
Jan 24 09:31:39 node4 kernel: drbd0: asender terminated
Jan 24 09:31:39 node4 kernel: drbd0: worker terminated
Jan 24 09:31:39 node4 kernel: drbd0: drbd0_receiver [8136]: cstate 
StandAlone --> StandAlone
Jan 24 09:31:39 node4 kernel: drbd0: Connection lost.
Jan 24 09:31:39 node4 kernel: drbd0: receiver terminated
Why didn''t work it? I wanted to make node4 to be SyncSource, node3 
behaved fine and was listening on the right port with cstate WFConnection.
Than I made a mistake, disabled hertbeat and rebooted node4. Well, both 
node was Secondary, and they started to sync, node3 was the SyncSource. 
Why? What could be the right command?
So the get synced. And after that, I don''t know exactly, when node4 
started to behave like node3 yesterday, it wrote ''System
haled'' and
everything stopped to work. I stoped heartbeat, reset, mount ost by 
hand, and now it looks fine, but who know, now I''m a bit paranoid.
Still I have to say, node3''s kernel was 1.6.0.1 with drbd 0.7.22 (but 
0.7.25 userland) until the last reboot above, I don''t know, it could 
cause a problem, or not.
Does anybody have an idea, what happened, what would have to make with 
any part of the history?
Thank you,
tamas
On Thu, 2008-01-24 at 19:26 +0100, Papp Tam?s wrote:> helo Everybody!Hi.> Does anybody have an idea, what happened, what would have to make with > any part of the history?I''m not sure how many people here use drbd or are expert enough on it to answer your questions, but perhaps a more drbd focused audience might yield more results from your query. Check out http://www.drbd.org/mailinglist.html which is a link right off of the main drbd page at http://www.drbd.org/. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080125/728ebbfb/attachment-0002.bin
Brian J. Murrell wrote:> I''m not sure how many people here use drbd or are expert enough on it to > answer your questions, but perhaps a more drbd focused audience might > yield more results from your query. Check out > http://www.drbd.org/mailinglist.html which is a link right off of the > main drbd page at http://www.drbd.org/. >Yes, I know, but there were other errors, and I hoped, maybe I can get some help here. Since this crash this two nodes doesn''t work correctly at all. I see this in the log: Jan 25 15:29:03 node3 kernel: LustreError: 4788:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at f2f53c00 x17596/t0 o101->MGS at MGC192.168.2.1@tc p_0:26 lens 232/240 ref 1 fl Rpc:/0/0 rc 0/0 Jan 25 15:29:03 node3 kernel: LustreError: 4788:0:(client.c:519:ptlrpc_import_delay_req()) Skipped 19 previous similar messages Jan 25 15:34:12 node3 kernel: LustreError: 137-5: UUID ''hallmark-OST0004_UUID'' is not available for connect (no target) Jan 25 15:34:12 node3 kernel: LustreError: 4625:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-19) req at c4cc6800 x18/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0 rc -19/0 Jan 25 15:34:36 node3 kernel: Lustre: 32315:0:(ldlm_lib.c:519:target_handle_reconnect()) hallmark-OST0003: 926f3926-700d-ecec-1aef-23e026c62343 reconnecting Jan 25 15:36:06 node3 kernel: Lustre: hallmark-OST0003: haven''t heard from client e03956aa-95bd-4a44-8486-90819afd4033 (at 192.168.0.21 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Jan 25 15:39:46 node3 kernel: LustreError: 4788:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at c43ab600 x17687/t0 o101->MGS at MGC192.168.2.1@tcp_0:26 lens 232/240 ref 1 fl Rpc:/0/0 rc 0/0 Jan 25 15:39:46 node3 kernel: LustreError: 4788:0:(client.c:519:ptlrpc_import_delay_req()) Skipped 19 previous similar messages Jan 25 15:45:16 node3 kernel: Lustre: hallmark-OST0003: haven''t heard from client 09a8da79-1b1f-dcb9-197d-28cc16213dd5 (at 192.168.0.112 at tcp) in 227 seconds I think it''s dead, and I am evicting it. Jan 25 15:50:11 node3 kernel: Lustre: hallmark-OST0003: haven''t heard from client d7aaa9e0-e4db-0013-2556-b594e735eba4 (at 192.168.0.181 at tcp) in 228 seconds I think it''s dead, and I am evicting it. Jan 25 15:50:30 node3 kernel: LustreError: 4788:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at c1b2c200 x17716/t0 o101->MGS at MGC192.168.2.1@tcp_0:26 lens 232/240 ref 1 fl Rpc:/0/0 rc 0/0 Jan 25 15:50:30 node3 kernel: LustreError: 4788:0:(client.c:519:ptlrpc_import_delay_req()) Skipped 19 previous similar messages And this: Jan 25 15:18:38 node3 kernel: Lustre: 4629:0:(ldlm_lib.c:519:target_handle_reconnect()) hallmark-OST0003: 36d92f10-2f06-8b84-74d9-4324f3a8bf52 reconnecting Jan 25 15:19:05 node3 kernel: LustreError: 137-5: UUID ''hallmark-OST0004_UUID'' is not available for connect (no target) Any idea? tamas