Mc Carthy, Fergal
2006-May-19 07:36 UTC
[Lustre-discuss] MDS/OST node kernel panics on file writes by client.
Loopback devices shouldn''t be causing an issue; have done that myself for that software combination in testing. Can''t think of anything else, and if you are satisfied that there is no way that the client and server can be using the bonding then I don''t know of anything else to suggest right now. Fergal. -- Fergal.McCarthy@HP.com (The contents of this message and any attachments to it are confidential and may be legally privileged. If you have received this message in error you should delete it from your system immediately and advise the sender. To any recipient of this message within HP, unless otherwise stated, you should consider this message and attachments as "HP CONFIDENTIAL".) -----Original Message----- From: lustre-discuss-admin@lists.clusterfs.com [mailto:lustre-discuss-admin@lists.clusterfs.com] On Behalf Of Selvi Kadirvel Sent: 03 February 2006 16:07 To: lustre-discuss@clusterfs.com Cc: Selvi Kadirvel Subject: Re: [Lustre-discuss] MDS/OST node kernel panics on file writes by client. I am using loopback block devices for the MDS and OST devices on the server. Could this be causing an issue? On Feb 3, 2006, at 10:28 AM, Mc Carthy, Fergal wrote:> Client log messages are consistent with a client losing > connectivity and > entering connection recovery waiting for the MDS to come back... > > Looking at your loaded kernel modules I see that bonding is one of > them; > are you using a bonded connection for the link between the client and > server nodes? I believe that there have been some stability issues in > certain circumstances in the past when using bonded connections... The > server panic stack trace does show a bulk timeout call so it may be > that > there is some sort of connectivity issue... Possibly try for an > unbonded > link between client and server nodes and see if the problem persists? > > Fergal. > > -- > > Fergal.McCarthy@HP.com > > (The contents of this message and any attachments to it are > confidential > and may be legally privileged. If you have received this message in > error you should delete it from your system immediately and advise the > sender. To any recipient of this message within HP, unless otherwise > stated, you should consider this message and attachments as "HP > CONFIDENTIAL".) > > > -----Original Message----- > From: lustre-discuss-admin@lists.clusterfs.com > [mailto:lustre-discuss-admin@lists.clusterfs.com] On Behalf Of Selvi > Kadirvel > Sent: 03 February 2006 15:12 > To: lustre-discuss@clusterfs.com > Cc: Selvi Kadirvel > Subject: [Lustre-discuss] MDS/OST node kernel panics on file writes by > client. > > I have Lustre 1.4.5 running on a 2.6.9-11rhel Kernel, both built from > source. I started with a 2-node test configuration consisting of one > client node (Node1) and an MDS & OST running on another(Node2). > > After I run lconf on both my nodes. The client get the lustre mount > point - /mnt/lustre. Then I can touch and mkdir any number of files/ > dirs in the lustre filesystem. But when I try to vi one of these > files or ls into the directory, Node2 crashes with the OOPS message > below. The netdump log of the client shows Lustre errors (also > attached below). > > Does anyone have any ideas on what could be happening? > > Thanks, > Selvi > > ----------- [cut here ] --------- [please bite here ] ---------Kernel > BUG at panic:74invalid operand: 0000 [1]CPU 0Modules linked in: mds > (U) lov(U) osc(U) mdc(U) obdfilter(U) fsfilt_ldiskfs(U) ldiskfs(U) ost > (U) ptlrpc(U) obdclass(U) lvfs(U) ksocknal(U) portals(U) libcfs(U) nfs > (U) lockd(U) netconsole(U) netdump(U) loop(U) autofs4(U) i2c_dev(U) > i2c_core(U) sunrpc(U) dm_mod(U) button(U) battery(U) ac(U) ohci_hcd > (U) ehci_hcd(U) tg3(U) e100(U) mii(U) bonding(U) sg(U) qla2300(U) > qla2xxx(U)(U) scsi_mod<ffffffff80137d62>{panic+211} > RSP: 0018:0000010103549a98 EFLAGS: 00010086 > RAX: 000000000000006e RBX: ffffffff80359f91 RCX: 0000000000012d97 > RDX: 00000000ffffff01 RSI: 0000000000012d97 RDI: ffffffff8041a360 > RBP: 0000000000000010 R08: 0000000000000004 R09: 000001010354978c > R10: 00000000000000c3 R11: ffffffff802d822c R12: 0000000000000000 > R13: 0000010104df8680 R14: 0000010102c13000 R15: 0000000000000005 > FS: 0000002a95564360(0000) GS:ffffffff80521200(0000) knlGS: > 00000000f7fafbb0 > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: 00000000f75ae000 CR3: 0000000000101000 CR4: 00000000000006e0 > Process ll_ost_29 (pid: 14357, threadinfo 0000010103548000, task > 000001010343cb20) > Stack: 0000003000000020 0000010103549b78 0000010103549ab8 > 00000000000010a1 > 0000010104c9f710 ffffffff803602c4 000000000000006c > 0000010104df86e0 > 0000000000000000 0000000000000000 > Call Trace:<ffffffffa06ec6f1>{:ost:obd_preprw+1093} <ffffffff80134b5c> > {add_wait_queue+185} > <ffffffffa06f36c9>{:ost:ost_brw_write+8475} <ffffffff80133742> > {default_wake_function+0} > <ffffffffa06ecd66>{:ost:ost_bulk_timeout+0} <ffffffffa06fd261> > {:ost:ost_handle+25648} > <ffffffffa062889b>{:ptlrpc:ptlrpc_server_handle_request+5128} > <ffffffffa01f2a35>{:libcfs:lcw_update_time+26} > <ffffffffa062b112>{:ptlrpc:ptlrpc_main+4027} > <ffffffff80133742>{default_wake_function+0} <ffffffffa062a14a> > {:ptlrpc:ptlrpc_retry_rqbds+0} > <ffffffffa062a14a>{:ptlrpc:ptlrpc_retry_rqbds+0} > <ffffffff80111373>{child_rip+8} > <ffffffffa062a157>{:ptlrpc:ptlrpc_main+0} <ffffffff8011136b> > {child_rip+0} > > > Code: 0f 0b ad b5 35 80 ff ff ff ff 4a 00 31 ff e8 47 bf fe ff 31 > RIP <ffffffff80137d62>{panic+211} RSP <0000010103549a98> > ----------------------------------------------------------------------> -- > > ---------------------------------------------------------------------- > > Lustre: A connection with 10.13.16.46 timed out; the network or that > node may be down. > LustreError: 20897:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts > ()) Timeout out conn->0xa0d102e ip 10.13.16.46:988 > LustreError: 20896:0:(socknal_lib-linux.c: > 813:ksocknal_lib_connect_sock()) Error -113 connecting > 10.13.17.198/1022 -> 10.13.16.46/988 > LustreError: Host 10.13.16.46 was unreachable; the network or that > node may be down, or Lustre may be misconfigured. > LustreError: 20896:0:(socknal_cb.c:2103:ksocknal_autoconnect()) > Deleting packet type 1 len 64 (0xa0d11c6 10.13.17.198->0xa0d102e > 10.13.17.198) > LustreError: 20896:0:(events.c:61:request_out_callback()) @@@ type 8, > status 19 req@000001012757fc00 x56/t0 o400->mds- > test_UUID@NID_hpcio6.local_UUID:12 lens 64/64 ref 2 fl Rpc:N/0/0 rc > 0/0 > LustreError: 21305:0:(client.c:945:ptlrpc_expire_one_request()) @@@ > timeout (sent at 1138832833, 3s ago) req@000001012757fc00 x56/t0 o400- >> mds-test_UUID@NID_hpcio6.local_UUID:12 lens 64/64 ref 1 fl Rpc:N/ > 0/0 rc 0/0 > LustreError: Connection to service mds-test via nid 10.13.16.46 was > lost; in progress operations using this service will wait for > recovery to complete. > Lustre: 21305:0:(import.c:142:ptlrpc_set_import_discon()) > MDC_hpc198.local_mds-test_MNT_hpc198: connection lost to mds- > test_UUID@NID_hpcio6.local_UUID > LustreError: 21306:0:(lib-move.c:1510:lib_api_put()) Error sending > PUT to 0xa0d102e: 19 > LustreError: 20896:0:(socknal_lib-linux.c: > 813:ksocknal_lib_connect_sock()) Error -113 connecting > 10.13.17.198/1022 -> 10.13.16.46/988 > LustreError: Host 10.13.16.46 was unreachable; the network or that > node may be down, or Lustre may be misconfigured. > LustreError: 20896:0:(socknal_cb.c:2103:ksocknal_autoconnect()) > Deleting packet type 1 len 240 (0xa0d11c6 10.13.17.198->0xa0d102e > 10.13.17.198) > LustreError: 20896:0:(socknal_cb.c:2103:ksocknal_autoconnect()) > previously skipped 2 similar messages > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@lists.clusterfs.com > https://lists.clusterfs.com/mailman/listinfo/lustre-discuss >_______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.clusterfs.com https://lists.clusterfs.com/mailman/listinfo/lustre-discuss
Peter Kjellström
2006-May-19 07:36 UTC
[Lustre-discuss] MDS/OST node kernel panics on file writes by client.
On Friday 03 February 2006 16:12, Selvi Kadirvel wrote:> I have Lustre 1.4.5 running on a 2.6.9-11rhel Kernel, both built from > source. I started with a 2-node test configuration consisting of one > client node (Node1) and an MDS & OST running on another(Node2). > > After I run lconf on both my nodes. The client get the lustre mount > point - /mnt/lustre. Then I can touch and mkdir any number of files/ > dirs in the lustre filesystem. But when I try to vi one of these > files or ls into the directory, Node2 crashes with the OOPS message > below.Funny, I had the same problem the first time I tried lustre. Got MDS and OST up, mounted it from the client, could do meta data, but OST panic''ed with first write... This on a 1.4.5.1+2.6.9-22rhel. For me the solution was one of 1) removed lvm stripe 2) ran lconf --reformat on the OST and MDS followed by a full reset (start OST, start MDS, mount..). Since I did both 1) and 2) at the same time I can''t really say what the problem was but I will add another OST with lvm strip soon-ish so I guess I''ll find out :-) /Peter> The netdump log of the client shows Lustre errors (also > attached below). > > Does anyone have any ideas on what could be happening? > > Thanks, > Selvi-- ------------------------------------------------------------ Peter Kjellstr?m | National Supercomputer Centre | Sweden | http://www.nsc.liu.se -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20060206/b155763e/attachment.bin
Mc Carthy, Fergal
2006-May-19 07:36 UTC
[Lustre-discuss] MDS/OST node kernel panics on file writes by client.
Client log messages are consistent with a client losing connectivity and entering connection recovery waiting for the MDS to come back... Looking at your loaded kernel modules I see that bonding is one of them; are you using a bonded connection for the link between the client and server nodes? I believe that there have been some stability issues in certain circumstances in the past when using bonded connections... The server panic stack trace does show a bulk timeout call so it may be that there is some sort of connectivity issue... Possibly try for an unbonded link between client and server nodes and see if the problem persists? Fergal. -- Fergal.McCarthy@HP.com (The contents of this message and any attachments to it are confidential and may be legally privileged. If you have received this message in error you should delete it from your system immediately and advise the sender. To any recipient of this message within HP, unless otherwise stated, you should consider this message and attachments as "HP CONFIDENTIAL".) -----Original Message----- From: lustre-discuss-admin@lists.clusterfs.com [mailto:lustre-discuss-admin@lists.clusterfs.com] On Behalf Of Selvi Kadirvel Sent: 03 February 2006 15:12 To: lustre-discuss@clusterfs.com Cc: Selvi Kadirvel Subject: [Lustre-discuss] MDS/OST node kernel panics on file writes by client. I have Lustre 1.4.5 running on a 2.6.9-11rhel Kernel, both built from source. I started with a 2-node test configuration consisting of one client node (Node1) and an MDS & OST running on another(Node2). After I run lconf on both my nodes. The client get the lustre mount point - /mnt/lustre. Then I can touch and mkdir any number of files/ dirs in the lustre filesystem. But when I try to vi one of these files or ls into the directory, Node2 crashes with the OOPS message below. The netdump log of the client shows Lustre errors (also attached below). Does anyone have any ideas on what could be happening? Thanks, Selvi ----------- [cut here ] --------- [please bite here ] ---------Kernel BUG at panic:74invalid operand: 0000 [1]CPU 0Modules linked in: mds (U) lov(U) osc(U) mdc(U) obdfilter(U) fsfilt_ldiskfs(U) ldiskfs(U) ost (U) ptlrpc(U) obdclass(U) lvfs(U) ksocknal(U) portals(U) libcfs(U) nfs (U) lockd(U) netconsole(U) netdump(U) loop(U) autofs4(U) i2c_dev(U) i2c_core(U) sunrpc(U) dm_mod(U) button(U) battery(U) ac(U) ohci_hcd (U) ehci_hcd(U) tg3(U) e100(U) mii(U) bonding(U) sg(U) qla2300(U) qla2xxx(U)(U) scsi_mod<ffffffff80137d62>{panic+211} RSP: 0018:0000010103549a98 EFLAGS: 00010086 RAX: 000000000000006e RBX: ffffffff80359f91 RCX: 0000000000012d97 RDX: 00000000ffffff01 RSI: 0000000000012d97 RDI: ffffffff8041a360 RBP: 0000000000000010 R08: 0000000000000004 R09: 000001010354978c R10: 00000000000000c3 R11: ffffffff802d822c R12: 0000000000000000 R13: 0000010104df8680 R14: 0000010102c13000 R15: 0000000000000005 FS: 0000002a95564360(0000) GS:ffffffff80521200(0000) knlGS: 00000000f7fafbb0 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000f75ae000 CR3: 0000000000101000 CR4: 00000000000006e0 Process ll_ost_29 (pid: 14357, threadinfo 0000010103548000, task 000001010343cb20) Stack: 0000003000000020 0000010103549b78 0000010103549ab8 00000000000010a1 0000010104c9f710 ffffffff803602c4 000000000000006c 0000010104df86e0 0000000000000000 0000000000000000 Call Trace:<ffffffffa06ec6f1>{:ost:obd_preprw+1093} <ffffffff80134b5c> {add_wait_queue+185} <ffffffffa06f36c9>{:ost:ost_brw_write+8475} <ffffffff80133742> {default_wake_function+0} <ffffffffa06ecd66>{:ost:ost_bulk_timeout+0} <ffffffffa06fd261> {:ost:ost_handle+25648} <ffffffffa062889b>{:ptlrpc:ptlrpc_server_handle_request+5128} <ffffffffa01f2a35>{:libcfs:lcw_update_time+26} <ffffffffa062b112>{:ptlrpc:ptlrpc_main+4027} <ffffffff80133742>{default_wake_function+0} <ffffffffa062a14a> {:ptlrpc:ptlrpc_retry_rqbds+0} <ffffffffa062a14a>{:ptlrpc:ptlrpc_retry_rqbds+0} <ffffffff80111373>{child_rip+8} <ffffffffa062a157>{:ptlrpc:ptlrpc_main+0} <ffffffff8011136b> {child_rip+0} Code: 0f 0b ad b5 35 80 ff ff ff ff 4a 00 31 ff e8 47 bf fe ff 31 RIP <ffffffff80137d62>{panic+211} RSP <0000010103549a98> ------------------------------------------------------------------------ ---------------------------------------------------------------------- Lustre: A connection with 10.13.16.46 timed out; the network or that node may be down. LustreError: 20897:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts ()) Timeout out conn->0xa0d102e ip 10.13.16.46:988 LustreError: 20896:0:(socknal_lib-linux.c: 813:ksocknal_lib_connect_sock()) Error -113 connecting 10.13.17.198/1022 -> 10.13.16.46/988 LustreError: Host 10.13.16.46 was unreachable; the network or that node may be down, or Lustre may be misconfigured. LustreError: 20896:0:(socknal_cb.c:2103:ksocknal_autoconnect()) Deleting packet type 1 len 64 (0xa0d11c6 10.13.17.198->0xa0d102e 10.13.17.198) LustreError: 20896:0:(events.c:61:request_out_callback()) @@@ type 8, status 19 req@000001012757fc00 x56/t0 o400->mds- test_UUID@NID_hpcio6.local_UUID:12 lens 64/64 ref 2 fl Rpc:N/0/0 rc 0/0 LustreError: 21305:0:(client.c:945:ptlrpc_expire_one_request()) @@@ timeout (sent at 1138832833, 3s ago) req@000001012757fc00 x56/t0 o400- >mds-test_UUID@NID_hpcio6.local_UUID:12 lens 64/64 ref 1 fl Rpc:N/ 0/0 rc 0/0 LustreError: Connection to service mds-test via nid 10.13.16.46 was lost; in progress operations using this service will wait for recovery to complete. Lustre: 21305:0:(import.c:142:ptlrpc_set_import_discon()) MDC_hpc198.local_mds-test_MNT_hpc198: connection lost to mds- test_UUID@NID_hpcio6.local_UUID LustreError: 21306:0:(lib-move.c:1510:lib_api_put()) Error sending PUT to 0xa0d102e: 19 LustreError: 20896:0:(socknal_lib-linux.c: 813:ksocknal_lib_connect_sock()) Error -113 connecting 10.13.17.198/1022 -> 10.13.16.46/988 LustreError: Host 10.13.16.46 was unreachable; the network or that node may be down, or Lustre may be misconfigured. LustreError: 20896:0:(socknal_cb.c:2103:ksocknal_autoconnect()) Deleting packet type 1 len 240 (0xa0d11c6 10.13.17.198->0xa0d102e 10.13.17.198) LustreError: 20896:0:(socknal_cb.c:2103:ksocknal_autoconnect()) previously skipped 2 similar messages _______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.clusterfs.com https://lists.clusterfs.com/mailman/listinfo/lustre-discuss
Selvi Kadirvel
2006-May-19 07:36 UTC
[Lustre-discuss] MDS/OST node kernel panics on file writes by client.
Fergal, There is a bonded interface between the client and server. But the Lustre setup uses the non-bonded interface. On Feb 3, 2006, at 10:28 AM, Mc Carthy, Fergal wrote:> Client log messages are consistent with a client losing > connectivity and > entering connection recovery waiting for the MDS to come back... > > Looking at your loaded kernel modules I see that bonding is one of > them; > are you using a bonded connection for the link between the client and > server nodes? I believe that there have been some stability issues in > certain circumstances in the past when using bonded connections... The > server panic stack trace does show a bulk timeout call so it may be > that > there is some sort of connectivity issue... Possibly try for an > unbonded > link between client and server nodes and see if the problem persists? > > Fergal. > > -- > > Fergal.McCarthy@HP.com > > (The contents of this message and any attachments to it are > confidential > and may be legally privileged. If you have received this message in > error you should delete it from your system immediately and advise the > sender. To any recipient of this message within HP, unless otherwise > stated, you should consider this message and attachments as "HP > CONFIDENTIAL".) > > > -----Original Message----- > From: lustre-discuss-admin@lists.clusterfs.com > [mailto:lustre-discuss-admin@lists.clusterfs.com] On Behalf Of Selvi > Kadirvel > Sent: 03 February 2006 15:12 > To: lustre-discuss@clusterfs.com > Cc: Selvi Kadirvel > Subject: [Lustre-discuss] MDS/OST node kernel panics on file writes by > client. > > I have Lustre 1.4.5 running on a 2.6.9-11rhel Kernel, both built from > source. I started with a 2-node test configuration consisting of one > client node (Node1) and an MDS & OST running on another(Node2). > > After I run lconf on both my nodes. The client get the lustre mount > point - /mnt/lustre. Then I can touch and mkdir any number of files/ > dirs in the lustre filesystem. But when I try to vi one of these > files or ls into the directory, Node2 crashes with the OOPS message > below. The netdump log of the client shows Lustre errors (also > attached below). > > Does anyone have any ideas on what could be happening? > > Thanks, > Selvi > > ----------- [cut here ] --------- [please bite here ] ---------Kernel > BUG at panic:74invalid operand: 0000 [1]CPU 0Modules linked in: mds > (U) lov(U) osc(U) mdc(U) obdfilter(U) fsfilt_ldiskfs(U) ldiskfs(U) ost > (U) ptlrpc(U) obdclass(U) lvfs(U) ksocknal(U) portals(U) libcfs(U) nfs > (U) lockd(U) netconsole(U) netdump(U) loop(U) autofs4(U) i2c_dev(U) > i2c_core(U) sunrpc(U) dm_mod(U) button(U) battery(U) ac(U) ohci_hcd > (U) ehci_hcd(U) tg3(U) e100(U) mii(U) bonding(U) sg(U) qla2300(U) > qla2xxx(U)(U) scsi_mod<ffffffff80137d62>{panic+211} > RSP: 0018:0000010103549a98 EFLAGS: 00010086 > RAX: 000000000000006e RBX: ffffffff80359f91 RCX: 0000000000012d97 > RDX: 00000000ffffff01 RSI: 0000000000012d97 RDI: ffffffff8041a360 > RBP: 0000000000000010 R08: 0000000000000004 R09: 000001010354978c > R10: 00000000000000c3 R11: ffffffff802d822c R12: 0000000000000000 > R13: 0000010104df8680 R14: 0000010102c13000 R15: 0000000000000005 > FS: 0000002a95564360(0000) GS:ffffffff80521200(0000) knlGS: > 00000000f7fafbb0 > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: 00000000f75ae000 CR3: 0000000000101000 CR4: 00000000000006e0 > Process ll_ost_29 (pid: 14357, threadinfo 0000010103548000, task > 000001010343cb20) > Stack: 0000003000000020 0000010103549b78 0000010103549ab8 > 00000000000010a1 > 0000010104c9f710 ffffffff803602c4 000000000000006c > 0000010104df86e0 > 0000000000000000 0000000000000000 > Call Trace:<ffffffffa06ec6f1>{:ost:obd_preprw+1093} <ffffffff80134b5c> > {add_wait_queue+185} > <ffffffffa06f36c9>{:ost:ost_brw_write+8475} <ffffffff80133742> > {default_wake_function+0} > <ffffffffa06ecd66>{:ost:ost_bulk_timeout+0} <ffffffffa06fd261> > {:ost:ost_handle+25648} > <ffffffffa062889b>{:ptlrpc:ptlrpc_server_handle_request+5128} > <ffffffffa01f2a35>{:libcfs:lcw_update_time+26} > <ffffffffa062b112>{:ptlrpc:ptlrpc_main+4027} > <ffffffff80133742>{default_wake_function+0} <ffffffffa062a14a> > {:ptlrpc:ptlrpc_retry_rqbds+0} > <ffffffffa062a14a>{:ptlrpc:ptlrpc_retry_rqbds+0} > <ffffffff80111373>{child_rip+8} > <ffffffffa062a157>{:ptlrpc:ptlrpc_main+0} <ffffffff8011136b> > {child_rip+0} > > > Code: 0f 0b ad b5 35 80 ff ff ff ff 4a 00 31 ff e8 47 bf fe ff 31 > RIP <ffffffff80137d62>{panic+211} RSP <0000010103549a98> > ---------------------------------------------------------------------- > -- > > ---------------------------------------------------------------------- > > Lustre: A connection with 10.13.16.46 timed out; the network or that > node may be down. > LustreError: 20897:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts > ()) Timeout out conn->0xa0d102e ip 10.13.16.46:988 > LustreError: 20896:0:(socknal_lib-linux.c: > 813:ksocknal_lib_connect_sock()) Error -113 connecting > 10.13.17.198/1022 -> 10.13.16.46/988 > LustreError: Host 10.13.16.46 was unreachable; the network or that > node may be down, or Lustre may be misconfigured. > LustreError: 20896:0:(socknal_cb.c:2103:ksocknal_autoconnect()) > Deleting packet type 1 len 64 (0xa0d11c6 10.13.17.198->0xa0d102e > 10.13.17.198) > LustreError: 20896:0:(events.c:61:request_out_callback()) @@@ type 8, > status 19 req@000001012757fc00 x56/t0 o400->mds- > test_UUID@NID_hpcio6.local_UUID:12 lens 64/64 ref 2 fl Rpc:N/0/0 rc > 0/0 > LustreError: 21305:0:(client.c:945:ptlrpc_expire_one_request()) @@@ > timeout (sent at 1138832833, 3s ago) req@000001012757fc00 x56/t0 o400- >> mds-test_UUID@NID_hpcio6.local_UUID:12 lens 64/64 ref 1 fl Rpc:N/ > 0/0 rc 0/0 > LustreError: Connection to service mds-test via nid 10.13.16.46 was > lost; in progress operations using this service will wait for > recovery to complete. > Lustre: 21305:0:(import.c:142:ptlrpc_set_import_discon()) > MDC_hpc198.local_mds-test_MNT_hpc198: connection lost to mds- > test_UUID@NID_hpcio6.local_UUID > LustreError: 21306:0:(lib-move.c:1510:lib_api_put()) Error sending > PUT to 0xa0d102e: 19 > LustreError: 20896:0:(socknal_lib-linux.c: > 813:ksocknal_lib_connect_sock()) Error -113 connecting > 10.13.17.198/1022 -> 10.13.16.46/988 > LustreError: Host 10.13.16.46 was unreachable; the network or that > node may be down, or Lustre may be misconfigured. > LustreError: 20896:0:(socknal_cb.c:2103:ksocknal_autoconnect()) > Deleting packet type 1 len 240 (0xa0d11c6 10.13.17.198->0xa0d102e > 10.13.17.198) > LustreError: 20896:0:(socknal_cb.c:2103:ksocknal_autoconnect()) > previously skipped 2 similar messages > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@lists.clusterfs.com > https://lists.clusterfs.com/mailman/listinfo/lustre-discuss >
Selvi Kadirvel
2006-May-19 07:36 UTC
[Lustre-discuss] MDS/OST node kernel panics on file writes by client.
I am using loopback block devices for the MDS and OST devices on the server. Could this be causing an issue? On Feb 3, 2006, at 10:28 AM, Mc Carthy, Fergal wrote:> Client log messages are consistent with a client losing > connectivity and > entering connection recovery waiting for the MDS to come back... > > Looking at your loaded kernel modules I see that bonding is one of > them; > are you using a bonded connection for the link between the client and > server nodes? I believe that there have been some stability issues in > certain circumstances in the past when using bonded connections... The > server panic stack trace does show a bulk timeout call so it may be > that > there is some sort of connectivity issue... Possibly try for an > unbonded > link between client and server nodes and see if the problem persists? > > Fergal. > > -- > > Fergal.McCarthy@HP.com > > (The contents of this message and any attachments to it are > confidential > and may be legally privileged. If you have received this message in > error you should delete it from your system immediately and advise the > sender. To any recipient of this message within HP, unless otherwise > stated, you should consider this message and attachments as "HP > CONFIDENTIAL".) > > > -----Original Message----- > From: lustre-discuss-admin@lists.clusterfs.com > [mailto:lustre-discuss-admin@lists.clusterfs.com] On Behalf Of Selvi > Kadirvel > Sent: 03 February 2006 15:12 > To: lustre-discuss@clusterfs.com > Cc: Selvi Kadirvel > Subject: [Lustre-discuss] MDS/OST node kernel panics on file writes by > client. > > I have Lustre 1.4.5 running on a 2.6.9-11rhel Kernel, both built from > source. I started with a 2-node test configuration consisting of one > client node (Node1) and an MDS & OST running on another(Node2). > > After I run lconf on both my nodes. The client get the lustre mount > point - /mnt/lustre. Then I can touch and mkdir any number of files/ > dirs in the lustre filesystem. But when I try to vi one of these > files or ls into the directory, Node2 crashes with the OOPS message > below. The netdump log of the client shows Lustre errors (also > attached below). > > Does anyone have any ideas on what could be happening? > > Thanks, > Selvi > > ----------- [cut here ] --------- [please bite here ] ---------Kernel > BUG at panic:74invalid operand: 0000 [1]CPU 0Modules linked in: mds > (U) lov(U) osc(U) mdc(U) obdfilter(U) fsfilt_ldiskfs(U) ldiskfs(U) ost > (U) ptlrpc(U) obdclass(U) lvfs(U) ksocknal(U) portals(U) libcfs(U) nfs > (U) lockd(U) netconsole(U) netdump(U) loop(U) autofs4(U) i2c_dev(U) > i2c_core(U) sunrpc(U) dm_mod(U) button(U) battery(U) ac(U) ohci_hcd > (U) ehci_hcd(U) tg3(U) e100(U) mii(U) bonding(U) sg(U) qla2300(U) > qla2xxx(U)(U) scsi_mod<ffffffff80137d62>{panic+211} > RSP: 0018:0000010103549a98 EFLAGS: 00010086 > RAX: 000000000000006e RBX: ffffffff80359f91 RCX: 0000000000012d97 > RDX: 00000000ffffff01 RSI: 0000000000012d97 RDI: ffffffff8041a360 > RBP: 0000000000000010 R08: 0000000000000004 R09: 000001010354978c > R10: 00000000000000c3 R11: ffffffff802d822c R12: 0000000000000000 > R13: 0000010104df8680 R14: 0000010102c13000 R15: 0000000000000005 > FS: 0000002a95564360(0000) GS:ffffffff80521200(0000) knlGS: > 00000000f7fafbb0 > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: 00000000f75ae000 CR3: 0000000000101000 CR4: 00000000000006e0 > Process ll_ost_29 (pid: 14357, threadinfo 0000010103548000, task > 000001010343cb20) > Stack: 0000003000000020 0000010103549b78 0000010103549ab8 > 00000000000010a1 > 0000010104c9f710 ffffffff803602c4 000000000000006c > 0000010104df86e0 > 0000000000000000 0000000000000000 > Call Trace:<ffffffffa06ec6f1>{:ost:obd_preprw+1093} <ffffffff80134b5c> > {add_wait_queue+185} > <ffffffffa06f36c9>{:ost:ost_brw_write+8475} <ffffffff80133742> > {default_wake_function+0} > <ffffffffa06ecd66>{:ost:ost_bulk_timeout+0} <ffffffffa06fd261> > {:ost:ost_handle+25648} > <ffffffffa062889b>{:ptlrpc:ptlrpc_server_handle_request+5128} > <ffffffffa01f2a35>{:libcfs:lcw_update_time+26} > <ffffffffa062b112>{:ptlrpc:ptlrpc_main+4027} > <ffffffff80133742>{default_wake_function+0} <ffffffffa062a14a> > {:ptlrpc:ptlrpc_retry_rqbds+0} > <ffffffffa062a14a>{:ptlrpc:ptlrpc_retry_rqbds+0} > <ffffffff80111373>{child_rip+8} > <ffffffffa062a157>{:ptlrpc:ptlrpc_main+0} <ffffffff8011136b> > {child_rip+0} > > > Code: 0f 0b ad b5 35 80 ff ff ff ff 4a 00 31 ff e8 47 bf fe ff 31 > RIP <ffffffff80137d62>{panic+211} RSP <0000010103549a98> > ---------------------------------------------------------------------- > -- > > ---------------------------------------------------------------------- > > Lustre: A connection with 10.13.16.46 timed out; the network or that > node may be down. > LustreError: 20897:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts > ()) Timeout out conn->0xa0d102e ip 10.13.16.46:988 > LustreError: 20896:0:(socknal_lib-linux.c: > 813:ksocknal_lib_connect_sock()) Error -113 connecting > 10.13.17.198/1022 -> 10.13.16.46/988 > LustreError: Host 10.13.16.46 was unreachable; the network or that > node may be down, or Lustre may be misconfigured. > LustreError: 20896:0:(socknal_cb.c:2103:ksocknal_autoconnect()) > Deleting packet type 1 len 64 (0xa0d11c6 10.13.17.198->0xa0d102e > 10.13.17.198) > LustreError: 20896:0:(events.c:61:request_out_callback()) @@@ type 8, > status 19 req@000001012757fc00 x56/t0 o400->mds- > test_UUID@NID_hpcio6.local_UUID:12 lens 64/64 ref 2 fl Rpc:N/0/0 rc > 0/0 > LustreError: 21305:0:(client.c:945:ptlrpc_expire_one_request()) @@@ > timeout (sent at 1138832833, 3s ago) req@000001012757fc00 x56/t0 o400- >> mds-test_UUID@NID_hpcio6.local_UUID:12 lens 64/64 ref 1 fl Rpc:N/ > 0/0 rc 0/0 > LustreError: Connection to service mds-test via nid 10.13.16.46 was > lost; in progress operations using this service will wait for > recovery to complete. > Lustre: 21305:0:(import.c:142:ptlrpc_set_import_discon()) > MDC_hpc198.local_mds-test_MNT_hpc198: connection lost to mds- > test_UUID@NID_hpcio6.local_UUID > LustreError: 21306:0:(lib-move.c:1510:lib_api_put()) Error sending > PUT to 0xa0d102e: 19 > LustreError: 20896:0:(socknal_lib-linux.c: > 813:ksocknal_lib_connect_sock()) Error -113 connecting > 10.13.17.198/1022 -> 10.13.16.46/988 > LustreError: Host 10.13.16.46 was unreachable; the network or that > node may be down, or Lustre may be misconfigured. > LustreError: 20896:0:(socknal_cb.c:2103:ksocknal_autoconnect()) > Deleting packet type 1 len 240 (0xa0d11c6 10.13.17.198->0xa0d102e > 10.13.17.198) > LustreError: 20896:0:(socknal_cb.c:2103:ksocknal_autoconnect()) > previously skipped 2 similar messages > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@lists.clusterfs.com > https://lists.clusterfs.com/mailman/listinfo/lustre-discuss >
Selvi Kadirvel
2006-May-19 07:36 UTC
[Lustre-discuss] MDS/OST node kernel panics on file writes by client.
I have Lustre 1.4.5 running on a 2.6.9-11rhel Kernel, both built from source. I started with a 2-node test configuration consisting of one client node (Node1) and an MDS & OST running on another(Node2). After I run lconf on both my nodes. The client get the lustre mount point - /mnt/lustre. Then I can touch and mkdir any number of files/ dirs in the lustre filesystem. But when I try to vi one of these files or ls into the directory, Node2 crashes with the OOPS message below. The netdump log of the client shows Lustre errors (also attached below). Does anyone have any ideas on what could be happening? Thanks, Selvi ----------- [cut here ] --------- [please bite here ] ---------Kernel BUG at panic:74invalid operand: 0000 [1]CPU 0Modules linked in: mds (U) lov(U) osc(U) mdc(U) obdfilter(U) fsfilt_ldiskfs(U) ldiskfs(U) ost (U) ptlrpc(U) obdclass(U) lvfs(U) ksocknal(U) portals(U) libcfs(U) nfs (U) lockd(U) netconsole(U) netdump(U) loop(U) autofs4(U) i2c_dev(U) i2c_core(U) sunrpc(U) dm_mod(U) button(U) battery(U) ac(U) ohci_hcd (U) ehci_hcd(U) tg3(U) e100(U) mii(U) bonding(U) sg(U) qla2300(U) qla2xxx(U)(U) scsi_mod<ffffffff80137d62>{panic+211} RSP: 0018:0000010103549a98 EFLAGS: 00010086 RAX: 000000000000006e RBX: ffffffff80359f91 RCX: 0000000000012d97 RDX: 00000000ffffff01 RSI: 0000000000012d97 RDI: ffffffff8041a360 RBP: 0000000000000010 R08: 0000000000000004 R09: 000001010354978c R10: 00000000000000c3 R11: ffffffff802d822c R12: 0000000000000000 R13: 0000010104df8680 R14: 0000010102c13000 R15: 0000000000000005 FS: 0000002a95564360(0000) GS:ffffffff80521200(0000) knlGS: 00000000f7fafbb0 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000f75ae000 CR3: 0000000000101000 CR4: 00000000000006e0 Process ll_ost_29 (pid: 14357, threadinfo 0000010103548000, task 000001010343cb20) Stack: 0000003000000020 0000010103549b78 0000010103549ab8 00000000000010a1 0000010104c9f710 ffffffff803602c4 000000000000006c 0000010104df86e0 0000000000000000 0000000000000000 Call Trace:<ffffffffa06ec6f1>{:ost:obd_preprw+1093} <ffffffff80134b5c> {add_wait_queue+185} <ffffffffa06f36c9>{:ost:ost_brw_write+8475} <ffffffff80133742> {default_wake_function+0} <ffffffffa06ecd66>{:ost:ost_bulk_timeout+0} <ffffffffa06fd261> {:ost:ost_handle+25648} <ffffffffa062889b>{:ptlrpc:ptlrpc_server_handle_request+5128} <ffffffffa01f2a35>{:libcfs:lcw_update_time+26} <ffffffffa062b112>{:ptlrpc:ptlrpc_main+4027} <ffffffff80133742>{default_wake_function+0} <ffffffffa062a14a> {:ptlrpc:ptlrpc_retry_rqbds+0} <ffffffffa062a14a>{:ptlrpc:ptlrpc_retry_rqbds+0} <ffffffff80111373>{child_rip+8} <ffffffffa062a157>{:ptlrpc:ptlrpc_main+0} <ffffffff8011136b> {child_rip+0} Code: 0f 0b ad b5 35 80 ff ff ff ff 4a 00 31 ff e8 47 bf fe ff 31 RIP <ffffffff80137d62>{panic+211} RSP <0000010103549a98> ------------------------------------------------------------------------ ---------------------------------------------------------------------- Lustre: A connection with 10.13.16.46 timed out; the network or that node may be down. LustreError: 20897:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts ()) Timeout out conn->0xa0d102e ip 10.13.16.46:988 LustreError: 20896:0:(socknal_lib-linux.c: 813:ksocknal_lib_connect_sock()) Error -113 connecting 10.13.17.198/1022 -> 10.13.16.46/988 LustreError: Host 10.13.16.46 was unreachable; the network or that node may be down, or Lustre may be misconfigured. LustreError: 20896:0:(socknal_cb.c:2103:ksocknal_autoconnect()) Deleting packet type 1 len 64 (0xa0d11c6 10.13.17.198->0xa0d102e 10.13.17.198) LustreError: 20896:0:(events.c:61:request_out_callback()) @@@ type 8, status 19 req@000001012757fc00 x56/t0 o400->mds- test_UUID@NID_hpcio6.local_UUID:12 lens 64/64 ref 2 fl Rpc:N/0/0 rc 0/0 LustreError: 21305:0:(client.c:945:ptlrpc_expire_one_request()) @@@ timeout (sent at 1138832833, 3s ago) req@000001012757fc00 x56/t0 o400- >mds-test_UUID@NID_hpcio6.local_UUID:12 lens 64/64 ref 1 fl Rpc:N/ 0/0 rc 0/0 LustreError: Connection to service mds-test via nid 10.13.16.46 was lost; in progress operations using this service will wait for recovery to complete. Lustre: 21305:0:(import.c:142:ptlrpc_set_import_discon()) MDC_hpc198.local_mds-test_MNT_hpc198: connection lost to mds- test_UUID@NID_hpcio6.local_UUID LustreError: 21306:0:(lib-move.c:1510:lib_api_put()) Error sending PUT to 0xa0d102e: 19 LustreError: 20896:0:(socknal_lib-linux.c: 813:ksocknal_lib_connect_sock()) Error -113 connecting 10.13.17.198/1022 -> 10.13.16.46/988 LustreError: Host 10.13.16.46 was unreachable; the network or that node may be down, or Lustre may be misconfigured. LustreError: 20896:0:(socknal_cb.c:2103:ksocknal_autoconnect()) Deleting packet type 1 len 240 (0xa0d11c6 10.13.17.198->0xa0d102e 10.13.17.198) LustreError: 20896:0:(socknal_cb.c:2103:ksocknal_autoconnect()) previously skipped 2 similar messages