thr3ads.net - Lustre discuss - [Lustre-discuss] MDS/OST node kernel panics on file writes by client. [May 2006]

If this information is useful, please help other people find it:
Share via:

Mc Carthy, Fergal

2006-May-19 07:36 UTC

[Lustre-discuss] MDS/OST node kernel panics on file writes by client.

Loopback devices shouldn''t be causing an issue; have done that myself
for that software combination in testing.

Can''t think of anything else, and if you are satisfied that there is no
way that the client and server can be using the bonding then I don''t
know of anything else to suggest right now.

Fergal.

--

Fergal.McCarthy@HP.com

(The contents of this message and any attachments to it are confidential
and may be legally privileged. If you have received this message in
error you should delete it from your system immediately and advise the
sender. To any recipient of this message within HP, unless otherwise
stated, you should consider this message and attachments as "HP
CONFIDENTIAL".)
 

-----Original Message-----
From: lustre-discuss-admin@lists.clusterfs.com
[mailto:lustre-discuss-admin@lists.clusterfs.com] On Behalf Of Selvi
Kadirvel
Sent: 03 February 2006 16:07
To: lustre-discuss@clusterfs.com
Cc: Selvi Kadirvel
Subject: Re: [Lustre-discuss] MDS/OST node kernel panics on file writes
by client.

I am using loopback block devices for the MDS and OST devices on the  
server. Could this be causing an issue?

On Feb 3, 2006, at 10:28 AM, Mc Carthy, Fergal wrote:
> Client log messages are consistent with a client losing  
> connectivity and
> entering connection recovery waiting for the MDS to come back...
>
> Looking at your loaded kernel modules I see that bonding is one of  
> them;
> are you using a bonded connection for the link between the client and
> server nodes? I believe that there have been some stability issues in
> certain circumstances in the past when using bonded connections... The
> server panic stack trace does show a bulk timeout call so it may be  
> that
> there is some sort of connectivity issue... Possibly try for an  
> unbonded
> link between client and server nodes and see if the problem persists?
>
> Fergal.
>
> --
>
> Fergal.McCarthy@HP.com
>
> (The contents of this message and any attachments to it are  
> confidential
> and may be legally privileged. If you have received this message in
> error you should delete it from your system immediately and advise the
> sender. To any recipient of this message within HP, unless otherwise
> stated, you should consider this message and attachments as "HP
> CONFIDENTIAL".)
>
>
> -----Original Message-----
> From: lustre-discuss-admin@lists.clusterfs.com
> [mailto:lustre-discuss-admin@lists.clusterfs.com] On Behalf Of Selvi
> Kadirvel
> Sent: 03 February 2006 15:12
> To: lustre-discuss@clusterfs.com
> Cc: Selvi Kadirvel
> Subject: [Lustre-discuss] MDS/OST node kernel panics on file writes by
> client.
>
> I have Lustre 1.4.5 running on a 2.6.9-11rhel Kernel, both built from
> source. I started with a 2-node test configuration consisting of one
> client node (Node1) and an MDS & OST running on another(Node2).
>
> After  I run lconf on both my nodes. The client get the lustre mount
> point - /mnt/lustre. Then I can touch and mkdir any number of files/
> dirs in the lustre filesystem. But when I try to vi one of these
> files or ls into the directory, Node2 crashes with the OOPS message
> below. The netdump log of the client shows Lustre errors (also
> attached below).
>
> Does anyone have any ideas on what could be happening?
>
> Thanks,
> Selvi
>
> ----------- [cut here ] --------- [please bite here ] ---------Kernel
> BUG at panic:74invalid operand: 0000 [1]CPU 0Modules linked in: mds
> (U) lov(U) osc(U) mdc(U) obdfilter(U) fsfilt_ldiskfs(U) ldiskfs(U) ost
> (U) ptlrpc(U) obdclass(U) lvfs(U) ksocknal(U) portals(U) libcfs(U) nfs
> (U) lockd(U) netconsole(U) netdump(U) loop(U) autofs4(U) i2c_dev(U)
> i2c_core(U) sunrpc(U) dm_mod(U) button(U) battery(U) ac(U) ohci_hcd
> (U) ehci_hcd(U) tg3(U) e100(U) mii(U) bonding(U) sg(U) qla2300(U)
> qla2xxx(U)(U) scsi_mod<ffffffff80137d62>{panic+211}
> RSP: 0018:0000010103549a98  EFLAGS: 00010086
> RAX: 000000000000006e RBX: ffffffff80359f91 RCX: 0000000000012d97
> RDX: 00000000ffffff01 RSI: 0000000000012d97 RDI: ffffffff8041a360
> RBP: 0000000000000010 R08: 0000000000000004 R09: 000001010354978c
> R10: 00000000000000c3 R11: ffffffff802d822c R12: 0000000000000000
> R13: 0000010104df8680 R14: 0000010102c13000 R15: 0000000000000005
> FS:  0000002a95564360(0000) GS:ffffffff80521200(0000) knlGS:
> 00000000f7fafbb0
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00000000f75ae000 CR3: 0000000000101000 CR4: 00000000000006e0
> Process ll_ost_29 (pid: 14357, threadinfo 0000010103548000, task
> 000001010343cb20)
> Stack: 0000003000000020 0000010103549b78 0000010103549ab8
> 00000000000010a1
>         0000010104c9f710 ffffffff803602c4 000000000000006c
> 0000010104df86e0
>         0000000000000000 0000000000000000
> Call Trace:<ffffffffa06ec6f1>{:ost:obd_preprw+1093}
<ffffffff80134b5c>
> {add_wait_queue+185}
>         <ffffffffa06f36c9>{:ost:ost_brw_write+8475}
<ffffffff80133742>
> {default_wake_function+0}
>         <ffffffffa06ecd66>{:ost:ost_bulk_timeout+0}
<ffffffffa06fd261>
> {:ost:ost_handle+25648}
>         <ffffffffa062889b>{:ptlrpc:ptlrpc_server_handle_request+5128}
>         <ffffffffa01f2a35>{:libcfs:lcw_update_time+26}
> <ffffffffa062b112>{:ptlrpc:ptlrpc_main+4027}
>         <ffffffff80133742>{default_wake_function+0}
<ffffffffa062a14a>
> {:ptlrpc:ptlrpc_retry_rqbds+0}
>         <ffffffffa062a14a>{:ptlrpc:ptlrpc_retry_rqbds+0}
> <ffffffff80111373>{child_rip+8}
>         <ffffffffa062a157>{:ptlrpc:ptlrpc_main+0}
<ffffffff8011136b>
> {child_rip+0}
>
>
> Code: 0f 0b ad b5 35 80 ff ff ff ff 4a 00 31 ff e8 47 bf fe ff 31
> RIP <ffffffff80137d62>{panic+211} RSP <0000010103549a98>
> ----------------------------------------------------------------------
> --
>
> ----------------------------------------------------------------------
>
> Lustre: A connection with 10.13.16.46 timed out; the network or that
> node may be down.
> LustreError: 20897:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts
> ()) Timeout out conn->0xa0d102e ip 10.13.16.46:988
> LustreError: 20896:0:(socknal_lib-linux.c:
> 813:ksocknal_lib_connect_sock()) Error -113 connecting
> 10.13.17.198/1022 -> 10.13.16.46/988
> LustreError: Host 10.13.16.46 was unreachable; the network or that
> node may be down, or Lustre may be misconfigured.
> LustreError: 20896:0:(socknal_cb.c:2103:ksocknal_autoconnect())
> Deleting packet type 1 len 64 (0xa0d11c6 10.13.17.198->0xa0d102e
> 10.13.17.198)
> LustreError: 20896:0:(events.c:61:request_out_callback()) @@@ type 8,
> status 19 req@000001012757fc00 x56/t0 o400->mds-
> test_UUID@NID_hpcio6.local_UUID:12 lens 64/64 ref 2 fl Rpc:N/0/0 rc  
> 0/0
> LustreError: 21305:0:(client.c:945:ptlrpc_expire_one_request()) @@@
> timeout (sent at 1138832833, 3s ago) req@000001012757fc00 x56/t0 o400-
>> mds-test_UUID@NID_hpcio6.local_UUID:12 lens 64/64 ref 1 fl Rpc:N/
> 0/0 rc 0/0
> LustreError: Connection to service mds-test via nid 10.13.16.46 was
> lost; in progress operations using this service will wait for
> recovery to complete.
> Lustre: 21305:0:(import.c:142:ptlrpc_set_import_discon())
> MDC_hpc198.local_mds-test_MNT_hpc198: connection lost to mds-
> test_UUID@NID_hpcio6.local_UUID
> LustreError: 21306:0:(lib-move.c:1510:lib_api_put()) Error sending
> PUT to 0xa0d102e: 19
> LustreError: 20896:0:(socknal_lib-linux.c:
> 813:ksocknal_lib_connect_sock()) Error -113 connecting
> 10.13.17.198/1022 -> 10.13.16.46/988
> LustreError: Host 10.13.16.46 was unreachable; the network or that
> node may be down, or Lustre may be misconfigured.
> LustreError: 20896:0:(socknal_cb.c:2103:ksocknal_autoconnect())
> Deleting packet type 1 len 240 (0xa0d11c6 10.13.17.198->0xa0d102e
> 10.13.17.198)
> LustreError: 20896:0:(socknal_cb.c:2103:ksocknal_autoconnect())
> previously skipped 2 similar messages
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss@lists.clusterfs.com
> https://lists.clusterfs.com/mailman/listinfo/lustre-discuss
>
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss@lists.clusterfs.com
https://lists.clusterfs.com/mailman/listinfo/lustre-discuss

Peter Kjellström

2006-May-19 07:36 UTC

head link

[Lustre-discuss] MDS/OST node kernel panics on file writes by client.

On Friday 03 February 2006 16:12, Selvi Kadirvel wrote:> I have Lustre 1.4.5 running on a 2.6.9-11rhel Kernel, both built from
> source. I started with a 2-node test configuration consisting of one
> client node (Node1) and an MDS & OST running on another(Node2).
>
> After  I run lconf on both my nodes. The client get the lustre mount
> point - /mnt/lustre. Then I can touch and mkdir any number of files/
> dirs in the lustre filesystem. But when I try to vi one of these
> files or ls into the directory, Node2 crashes with the OOPS message
> below.
Funny, I had the same problem the first time I tried lustre. Got MDS and OST 
up, mounted it from the client, could do meta data, but OST panic''ed
with
first write... This on a 1.4.5.1+2.6.9-22rhel.

For me the solution was one of 1) removed lvm stripe 2) ran lconf --reformat 
on the OST and MDS followed by a full reset (start OST, start MDS, mount..).

Since I did both 1) and 2) at the same time I can''t really say what the
problem was but I will add another OST with lvm strip soon-ish so I guess 
I''ll find out :-)

/Peter
> The netdump log of the client shows Lustre errors (also 
> attached below).
>
> Does anyone have any ideas on what could be happening?
>
> Thanks,
> Selvi
-- 
------------------------------------------------------------
  Peter Kjellstr?m               |
  National Supercomputer Centre  |
  Sweden                         | http://www.nsc.liu.se
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url :
http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20060206/b155763e/attachment.bin

Mc Carthy, Fergal

2006-May-19 07:36 UTC

head link

[Lustre-discuss] MDS/OST node kernel panics on file writes by client.

Client log messages are consistent with a client losing connectivity and
entering connection recovery waiting for the MDS to come back...

Looking at your loaded kernel modules I see that bonding is one of them;
are you using a bonded connection for the link between the client and
server nodes? I believe that there have been some stability issues in
certain circumstances in the past when using bonded connections... The
server panic stack trace does show a bulk timeout call so it may be that
there is some sort of connectivity issue... Possibly try for an unbonded
link between client and server nodes and see if the problem persists?

Fergal.

--

Fergal.McCarthy@HP.com

(The contents of this message and any attachments to it are confidential
and may be legally privileged. If you have received this message in
error you should delete it from your system immediately and advise the
sender. To any recipient of this message within HP, unless otherwise
stated, you should consider this message and attachments as "HP
CONFIDENTIAL".)
 

-----Original Message-----
From: lustre-discuss-admin@lists.clusterfs.com
[mailto:lustre-discuss-admin@lists.clusterfs.com] On Behalf Of Selvi
Kadirvel
Sent: 03 February 2006 15:12
To: lustre-discuss@clusterfs.com
Cc: Selvi Kadirvel
Subject: [Lustre-discuss] MDS/OST node kernel panics on file writes by
client.

I have Lustre 1.4.5 running on a 2.6.9-11rhel Kernel, both built from  
source. I started with a 2-node test configuration consisting of one  
client node (Node1) and an MDS & OST running on another(Node2).

After  I run lconf on both my nodes. The client get the lustre mount  
point - /mnt/lustre. Then I can touch and mkdir any number of files/ 
dirs in the lustre filesystem. But when I try to vi one of these  
files or ls into the directory, Node2 crashes with the OOPS message  
below. The netdump log of the client shows Lustre errors (also  
attached below).

Does anyone have any ideas on what could be happening?

Thanks,
Selvi

----------- [cut here ] --------- [please bite here ] ---------Kernel  
BUG at panic:74invalid operand: 0000 [1]CPU 0Modules linked in: mds 
(U) lov(U) osc(U) mdc(U) obdfilter(U) fsfilt_ldiskfs(U) ldiskfs(U) ost 
(U) ptlrpc(U) obdclass(U) lvfs(U) ksocknal(U) portals(U) libcfs(U) nfs 
(U) lockd(U) netconsole(U) netdump(U) loop(U) autofs4(U) i2c_dev(U)  
i2c_core(U) sunrpc(U) dm_mod(U) button(U) battery(U) ac(U) ohci_hcd 
(U) ehci_hcd(U) tg3(U) e100(U) mii(U) bonding(U) sg(U) qla2300(U)  
qla2xxx(U)(U) scsi_mod<ffffffff80137d62>{panic+211}
RSP: 0018:0000010103549a98  EFLAGS: 00010086
RAX: 000000000000006e RBX: ffffffff80359f91 RCX: 0000000000012d97
RDX: 00000000ffffff01 RSI: 0000000000012d97 RDI: ffffffff8041a360
RBP: 0000000000000010 R08: 0000000000000004 R09: 000001010354978c
R10: 00000000000000c3 R11: ffffffff802d822c R12: 0000000000000000
R13: 0000010104df8680 R14: 0000010102c13000 R15: 0000000000000005
FS:  0000002a95564360(0000) GS:ffffffff80521200(0000) knlGS: 
00000000f7fafbb0
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000f75ae000 CR3: 0000000000101000 CR4: 00000000000006e0
Process ll_ost_29 (pid: 14357, threadinfo 0000010103548000, task  
000001010343cb20)
Stack: 0000003000000020 0000010103549b78 0000010103549ab8  
00000000000010a1
        0000010104c9f710 ffffffff803602c4 000000000000006c  
0000010104df86e0
        0000000000000000 0000000000000000
Call Trace:<ffffffffa06ec6f1>{:ost:obd_preprw+1093}
<ffffffff80134b5c>
{add_wait_queue+185}
        <ffffffffa06f36c9>{:ost:ost_brw_write+8475}
<ffffffff80133742>
{default_wake_function+0}
        <ffffffffa06ecd66>{:ost:ost_bulk_timeout+0}
<ffffffffa06fd261>
{:ost:ost_handle+25648}
        <ffffffffa062889b>{:ptlrpc:ptlrpc_server_handle_request+5128}
        <ffffffffa01f2a35>{:libcfs:lcw_update_time+26}  
<ffffffffa062b112>{:ptlrpc:ptlrpc_main+4027}
        <ffffffff80133742>{default_wake_function+0}
<ffffffffa062a14a>
{:ptlrpc:ptlrpc_retry_rqbds+0}
        <ffffffffa062a14a>{:ptlrpc:ptlrpc_retry_rqbds+0}  
<ffffffff80111373>{child_rip+8}
        <ffffffffa062a157>{:ptlrpc:ptlrpc_main+0} <ffffffff8011136b>
{child_rip+0}


Code: 0f 0b ad b5 35 80 ff ff ff ff 4a 00 31 ff e8 47 bf fe ff 31
RIP <ffffffff80137d62>{panic+211} RSP <0000010103549a98>
------------------------------------------------------------------------

----------------------------------------------------------------------

Lustre: A connection with 10.13.16.46 timed out; the network or that  
node may be down.
LustreError: 20897:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts 
()) Timeout out conn->0xa0d102e ip 10.13.16.46:988
LustreError: 20896:0:(socknal_lib-linux.c: 
813:ksocknal_lib_connect_sock()) Error -113 connecting  
10.13.17.198/1022 -> 10.13.16.46/988
LustreError: Host 10.13.16.46 was unreachable; the network or that  
node may be down, or Lustre may be misconfigured.
LustreError: 20896:0:(socknal_cb.c:2103:ksocknal_autoconnect())  
Deleting packet type 1 len 64 (0xa0d11c6 10.13.17.198->0xa0d102e  
10.13.17.198)
LustreError: 20896:0:(events.c:61:request_out_callback()) @@@ type 8,  
status 19 req@000001012757fc00 x56/t0 o400->mds- 
test_UUID@NID_hpcio6.local_UUID:12 lens 64/64 ref 2 fl Rpc:N/0/0 rc 0/0
LustreError: 21305:0:(client.c:945:ptlrpc_expire_one_request()) @@@  
timeout (sent at 1138832833, 3s ago) req@000001012757fc00 x56/t0 o400- 
 >mds-test_UUID@NID_hpcio6.local_UUID:12 lens 64/64 ref 1 fl Rpc:N/ 
0/0 rc 0/0
LustreError: Connection to service mds-test via nid 10.13.16.46 was  
lost; in progress operations using this service will wait for  
recovery to complete.
Lustre: 21305:0:(import.c:142:ptlrpc_set_import_discon())  
MDC_hpc198.local_mds-test_MNT_hpc198: connection lost to mds- 
test_UUID@NID_hpcio6.local_UUID
LustreError: 21306:0:(lib-move.c:1510:lib_api_put()) Error sending  
PUT to 0xa0d102e: 19
LustreError: 20896:0:(socknal_lib-linux.c: 
813:ksocknal_lib_connect_sock()) Error -113 connecting  
10.13.17.198/1022 -> 10.13.16.46/988
LustreError: Host 10.13.16.46 was unreachable; the network or that  
node may be down, or Lustre may be misconfigured.
LustreError: 20896:0:(socknal_cb.c:2103:ksocknal_autoconnect())  
Deleting packet type 1 len 240 (0xa0d11c6 10.13.17.198->0xa0d102e  
10.13.17.198)
LustreError: 20896:0:(socknal_cb.c:2103:ksocknal_autoconnect())  
previously skipped 2 similar messages

_______________________________________________
Lustre-discuss mailing list
Lustre-discuss@lists.clusterfs.com
https://lists.clusterfs.com/mailman/listinfo/lustre-discuss

Selvi Kadirvel

2006-May-19 07:36 UTC

head link

[Lustre-discuss] MDS/OST node kernel panics on file writes by client.

Fergal,

There is a bonded interface between the client and server. But the  
Lustre setup uses the non-bonded interface.


On Feb 3, 2006, at 10:28 AM, Mc Carthy, Fergal wrote:
> Client log messages are consistent with a client losing  
> connectivity and
> entering connection recovery waiting for the MDS to come back...
>
> Looking at your loaded kernel modules I see that bonding is one of  
> them;
> are you using a bonded connection for the link between the client and
> server nodes? I believe that there have been some stability issues in
> certain circumstances in the past when using bonded connections... The
> server panic stack trace does show a bulk timeout call so it may be  
> that
> there is some sort of connectivity issue... Possibly try for an  
> unbonded
> link between client and server nodes and see if the problem persists?
>
> Fergal.
>
> --
>
> Fergal.McCarthy@HP.com
>
> (The contents of this message and any attachments to it are  
> confidential
> and may be legally privileged. If you have received this message in
> error you should delete it from your system immediately and advise the
> sender. To any recipient of this message within HP, unless otherwise
> stated, you should consider this message and attachments as "HP
> CONFIDENTIAL".)
>
>
> -----Original Message-----
> From: lustre-discuss-admin@lists.clusterfs.com
> [mailto:lustre-discuss-admin@lists.clusterfs.com] On Behalf Of Selvi
> Kadirvel
> Sent: 03 February 2006 15:12
> To: lustre-discuss@clusterfs.com
> Cc: Selvi Kadirvel
> Subject: [Lustre-discuss] MDS/OST node kernel panics on file writes by
> client.
>
> I have Lustre 1.4.5 running on a 2.6.9-11rhel Kernel, both built from
> source. I started with a 2-node test configuration consisting of one
> client node (Node1) and an MDS & OST running on another(Node2).
>
> After  I run lconf on both my nodes. The client get the lustre mount
> point - /mnt/lustre. Then I can touch and mkdir any number of files/
> dirs in the lustre filesystem. But when I try to vi one of these
> files or ls into the directory, Node2 crashes with the OOPS message
> below. The netdump log of the client shows Lustre errors (also
> attached below).
>
> Does anyone have any ideas on what could be happening?
>
> Thanks,
> Selvi
>
> ----------- [cut here ] --------- [please bite here ] ---------Kernel
> BUG at panic:74invalid operand: 0000 [1]CPU 0Modules linked in: mds
> (U) lov(U) osc(U) mdc(U) obdfilter(U) fsfilt_ldiskfs(U) ldiskfs(U) ost
> (U) ptlrpc(U) obdclass(U) lvfs(U) ksocknal(U) portals(U) libcfs(U) nfs
> (U) lockd(U) netconsole(U) netdump(U) loop(U) autofs4(U) i2c_dev(U)
> i2c_core(U) sunrpc(U) dm_mod(U) button(U) battery(U) ac(U) ohci_hcd
> (U) ehci_hcd(U) tg3(U) e100(U) mii(U) bonding(U) sg(U) qla2300(U)
> qla2xxx(U)(U) scsi_mod<ffffffff80137d62>{panic+211}
> RSP: 0018:0000010103549a98  EFLAGS: 00010086
> RAX: 000000000000006e RBX: ffffffff80359f91 RCX: 0000000000012d97
> RDX: 00000000ffffff01 RSI: 0000000000012d97 RDI: ffffffff8041a360
> RBP: 0000000000000010 R08: 0000000000000004 R09: 000001010354978c
> R10: 00000000000000c3 R11: ffffffff802d822c R12: 0000000000000000
> R13: 0000010104df8680 R14: 0000010102c13000 R15: 0000000000000005
> FS:  0000002a95564360(0000) GS:ffffffff80521200(0000) knlGS:
> 00000000f7fafbb0
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00000000f75ae000 CR3: 0000000000101000 CR4: 00000000000006e0
> Process ll_ost_29 (pid: 14357, threadinfo 0000010103548000, task
> 000001010343cb20)
> Stack: 0000003000000020 0000010103549b78 0000010103549ab8
> 00000000000010a1
>         0000010104c9f710 ffffffff803602c4 000000000000006c
> 0000010104df86e0
>         0000000000000000 0000000000000000
> Call Trace:<ffffffffa06ec6f1>{:ost:obd_preprw+1093}
<ffffffff80134b5c>
> {add_wait_queue+185}
>         <ffffffffa06f36c9>{:ost:ost_brw_write+8475}
<ffffffff80133742>
> {default_wake_function+0}
>         <ffffffffa06ecd66>{:ost:ost_bulk_timeout+0}
<ffffffffa06fd261>
> {:ost:ost_handle+25648}
>         <ffffffffa062889b>{:ptlrpc:ptlrpc_server_handle_request+5128}
>         <ffffffffa01f2a35>{:libcfs:lcw_update_time+26}
> <ffffffffa062b112>{:ptlrpc:ptlrpc_main+4027}
>         <ffffffff80133742>{default_wake_function+0}
<ffffffffa062a14a>
> {:ptlrpc:ptlrpc_retry_rqbds+0}
>         <ffffffffa062a14a>{:ptlrpc:ptlrpc_retry_rqbds+0}
> <ffffffff80111373>{child_rip+8}
>         <ffffffffa062a157>{:ptlrpc:ptlrpc_main+0}
<ffffffff8011136b>
> {child_rip+0}
>
>
> Code: 0f 0b ad b5 35 80 ff ff ff ff 4a 00 31 ff e8 47 bf fe ff 31
> RIP <ffffffff80137d62>{panic+211} RSP <0000010103549a98>
> ---------------------------------------------------------------------- 
> --
>
> ----------------------------------------------------------------------
>
> Lustre: A connection with 10.13.16.46 timed out; the network or that
> node may be down.
> LustreError: 20897:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts
> ()) Timeout out conn->0xa0d102e ip 10.13.16.46:988
> LustreError: 20896:0:(socknal_lib-linux.c:
> 813:ksocknal_lib_connect_sock()) Error -113 connecting
> 10.13.17.198/1022 -> 10.13.16.46/988
> LustreError: Host 10.13.16.46 was unreachable; the network or that
> node may be down, or Lustre may be misconfigured.
> LustreError: 20896:0:(socknal_cb.c:2103:ksocknal_autoconnect())
> Deleting packet type 1 len 64 (0xa0d11c6 10.13.17.198->0xa0d102e
> 10.13.17.198)
> LustreError: 20896:0:(events.c:61:request_out_callback()) @@@ type 8,
> status 19 req@000001012757fc00 x56/t0 o400->mds-
> test_UUID@NID_hpcio6.local_UUID:12 lens 64/64 ref 2 fl Rpc:N/0/0 rc  
> 0/0
> LustreError: 21305:0:(client.c:945:ptlrpc_expire_one_request()) @@@
> timeout (sent at 1138832833, 3s ago) req@000001012757fc00 x56/t0 o400-
>> mds-test_UUID@NID_hpcio6.local_UUID:12 lens 64/64 ref 1 fl Rpc:N/
> 0/0 rc 0/0
> LustreError: Connection to service mds-test via nid 10.13.16.46 was
> lost; in progress operations using this service will wait for
> recovery to complete.
> Lustre: 21305:0:(import.c:142:ptlrpc_set_import_discon())
> MDC_hpc198.local_mds-test_MNT_hpc198: connection lost to mds-
> test_UUID@NID_hpcio6.local_UUID
> LustreError: 21306:0:(lib-move.c:1510:lib_api_put()) Error sending
> PUT to 0xa0d102e: 19
> LustreError: 20896:0:(socknal_lib-linux.c:
> 813:ksocknal_lib_connect_sock()) Error -113 connecting
> 10.13.17.198/1022 -> 10.13.16.46/988
> LustreError: Host 10.13.16.46 was unreachable; the network or that
> node may be down, or Lustre may be misconfigured.
> LustreError: 20896:0:(socknal_cb.c:2103:ksocknal_autoconnect())
> Deleting packet type 1 len 240 (0xa0d11c6 10.13.17.198->0xa0d102e
> 10.13.17.198)
> LustreError: 20896:0:(socknal_cb.c:2103:ksocknal_autoconnect())
> previously skipped 2 similar messages
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss@lists.clusterfs.com
> https://lists.clusterfs.com/mailman/listinfo/lustre-discuss
>

Selvi Kadirvel

2006-May-19 07:36 UTC

head link

[Lustre-discuss] MDS/OST node kernel panics on file writes by client.

I am using loopback block devices for the MDS and OST devices on the  
server. Could this be causing an issue?

On Feb 3, 2006, at 10:28 AM, Mc Carthy, Fergal wrote:
> Client log messages are consistent with a client losing  
> connectivity and
> entering connection recovery waiting for the MDS to come back...
>
> Looking at your loaded kernel modules I see that bonding is one of  
> them;
> are you using a bonded connection for the link between the client and
> server nodes? I believe that there have been some stability issues in
> certain circumstances in the past when using bonded connections... The
> server panic stack trace does show a bulk timeout call so it may be  
> that
> there is some sort of connectivity issue... Possibly try for an  
> unbonded
> link between client and server nodes and see if the problem persists?
>
> Fergal.
>
> --
>
> Fergal.McCarthy@HP.com
>
> (The contents of this message and any attachments to it are  
> confidential
> and may be legally privileged. If you have received this message in
> error you should delete it from your system immediately and advise the
> sender. To any recipient of this message within HP, unless otherwise
> stated, you should consider this message and attachments as "HP
> CONFIDENTIAL".)
>
>
> -----Original Message-----
> From: lustre-discuss-admin@lists.clusterfs.com
> [mailto:lustre-discuss-admin@lists.clusterfs.com] On Behalf Of Selvi
> Kadirvel
> Sent: 03 February 2006 15:12
> To: lustre-discuss@clusterfs.com
> Cc: Selvi Kadirvel
> Subject: [Lustre-discuss] MDS/OST node kernel panics on file writes by
> client.
>
> I have Lustre 1.4.5 running on a 2.6.9-11rhel Kernel, both built from
> source. I started with a 2-node test configuration consisting of one
> client node (Node1) and an MDS & OST running on another(Node2).
>
> After  I run lconf on both my nodes. The client get the lustre mount
> point - /mnt/lustre. Then I can touch and mkdir any number of files/
> dirs in the lustre filesystem. But when I try to vi one of these
> files or ls into the directory, Node2 crashes with the OOPS message
> below. The netdump log of the client shows Lustre errors (also
> attached below).
>
> Does anyone have any ideas on what could be happening?
>
> Thanks,
> Selvi
>
> ----------- [cut here ] --------- [please bite here ] ---------Kernel
> BUG at panic:74invalid operand: 0000 [1]CPU 0Modules linked in: mds
> (U) lov(U) osc(U) mdc(U) obdfilter(U) fsfilt_ldiskfs(U) ldiskfs(U) ost
> (U) ptlrpc(U) obdclass(U) lvfs(U) ksocknal(U) portals(U) libcfs(U) nfs
> (U) lockd(U) netconsole(U) netdump(U) loop(U) autofs4(U) i2c_dev(U)
> i2c_core(U) sunrpc(U) dm_mod(U) button(U) battery(U) ac(U) ohci_hcd
> (U) ehci_hcd(U) tg3(U) e100(U) mii(U) bonding(U) sg(U) qla2300(U)
> qla2xxx(U)(U) scsi_mod<ffffffff80137d62>{panic+211}
> RSP: 0018:0000010103549a98  EFLAGS: 00010086
> RAX: 000000000000006e RBX: ffffffff80359f91 RCX: 0000000000012d97
> RDX: 00000000ffffff01 RSI: 0000000000012d97 RDI: ffffffff8041a360
> RBP: 0000000000000010 R08: 0000000000000004 R09: 000001010354978c
> R10: 00000000000000c3 R11: ffffffff802d822c R12: 0000000000000000
> R13: 0000010104df8680 R14: 0000010102c13000 R15: 0000000000000005
> FS:  0000002a95564360(0000) GS:ffffffff80521200(0000) knlGS:
> 00000000f7fafbb0
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00000000f75ae000 CR3: 0000000000101000 CR4: 00000000000006e0
> Process ll_ost_29 (pid: 14357, threadinfo 0000010103548000, task
> 000001010343cb20)
> Stack: 0000003000000020 0000010103549b78 0000010103549ab8
> 00000000000010a1
>         0000010104c9f710 ffffffff803602c4 000000000000006c
> 0000010104df86e0
>         0000000000000000 0000000000000000
> Call Trace:<ffffffffa06ec6f1>{:ost:obd_preprw+1093}
<ffffffff80134b5c>
> {add_wait_queue+185}
>         <ffffffffa06f36c9>{:ost:ost_brw_write+8475}
<ffffffff80133742>
> {default_wake_function+0}
>         <ffffffffa06ecd66>{:ost:ost_bulk_timeout+0}
<ffffffffa06fd261>
> {:ost:ost_handle+25648}
>         <ffffffffa062889b>{:ptlrpc:ptlrpc_server_handle_request+5128}
>         <ffffffffa01f2a35>{:libcfs:lcw_update_time+26}
> <ffffffffa062b112>{:ptlrpc:ptlrpc_main+4027}
>         <ffffffff80133742>{default_wake_function+0}
<ffffffffa062a14a>
> {:ptlrpc:ptlrpc_retry_rqbds+0}
>         <ffffffffa062a14a>{:ptlrpc:ptlrpc_retry_rqbds+0}
> <ffffffff80111373>{child_rip+8}
>         <ffffffffa062a157>{:ptlrpc:ptlrpc_main+0}
<ffffffff8011136b>
> {child_rip+0}
>
>
> Code: 0f 0b ad b5 35 80 ff ff ff ff 4a 00 31 ff e8 47 bf fe ff 31
> RIP <ffffffff80137d62>{panic+211} RSP <0000010103549a98>
> ---------------------------------------------------------------------- 
> --
>
> ----------------------------------------------------------------------
>
> Lustre: A connection with 10.13.16.46 timed out; the network or that
> node may be down.
> LustreError: 20897:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts
> ()) Timeout out conn->0xa0d102e ip 10.13.16.46:988
> LustreError: 20896:0:(socknal_lib-linux.c:
> 813:ksocknal_lib_connect_sock()) Error -113 connecting
> 10.13.17.198/1022 -> 10.13.16.46/988
> LustreError: Host 10.13.16.46 was unreachable; the network or that
> node may be down, or Lustre may be misconfigured.
> LustreError: 20896:0:(socknal_cb.c:2103:ksocknal_autoconnect())
> Deleting packet type 1 len 64 (0xa0d11c6 10.13.17.198->0xa0d102e
> 10.13.17.198)
> LustreError: 20896:0:(events.c:61:request_out_callback()) @@@ type 8,
> status 19 req@000001012757fc00 x56/t0 o400->mds-
> test_UUID@NID_hpcio6.local_UUID:12 lens 64/64 ref 2 fl Rpc:N/0/0 rc  
> 0/0
> LustreError: 21305:0:(client.c:945:ptlrpc_expire_one_request()) @@@
> timeout (sent at 1138832833, 3s ago) req@000001012757fc00 x56/t0 o400-
>> mds-test_UUID@NID_hpcio6.local_UUID:12 lens 64/64 ref 1 fl Rpc:N/
> 0/0 rc 0/0
> LustreError: Connection to service mds-test via nid 10.13.16.46 was
> lost; in progress operations using this service will wait for
> recovery to complete.
> Lustre: 21305:0:(import.c:142:ptlrpc_set_import_discon())
> MDC_hpc198.local_mds-test_MNT_hpc198: connection lost to mds-
> test_UUID@NID_hpcio6.local_UUID
> LustreError: 21306:0:(lib-move.c:1510:lib_api_put()) Error sending
> PUT to 0xa0d102e: 19
> LustreError: 20896:0:(socknal_lib-linux.c:
> 813:ksocknal_lib_connect_sock()) Error -113 connecting
> 10.13.17.198/1022 -> 10.13.16.46/988
> LustreError: Host 10.13.16.46 was unreachable; the network or that
> node may be down, or Lustre may be misconfigured.
> LustreError: 20896:0:(socknal_cb.c:2103:ksocknal_autoconnect())
> Deleting packet type 1 len 240 (0xa0d11c6 10.13.17.198->0xa0d102e
> 10.13.17.198)
> LustreError: 20896:0:(socknal_cb.c:2103:ksocknal_autoconnect())
> previously skipped 2 similar messages
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss@lists.clusterfs.com
> https://lists.clusterfs.com/mailman/listinfo/lustre-discuss
>

Selvi Kadirvel

2006-May-19 07:36 UTC

head link

[Lustre-discuss] MDS/OST node kernel panics on file writes by client.

I have Lustre 1.4.5 running on a 2.6.9-11rhel Kernel, both built from  
source. I started with a 2-node test configuration consisting of one  
client node (Node1) and an MDS & OST running on another(Node2).

After  I run lconf on both my nodes. The client get the lustre mount  
point - /mnt/lustre. Then I can touch and mkdir any number of files/ 
dirs in the lustre filesystem. But when I try to vi one of these  
files or ls into the directory, Node2 crashes with the OOPS message  
below. The netdump log of the client shows Lustre errors (also  
attached below).

Does anyone have any ideas on what could be happening?

Thanks,
Selvi

----------- [cut here ] --------- [please bite here ] ---------Kernel  
BUG at panic:74invalid operand: 0000 [1]CPU 0Modules linked in: mds 
(U) lov(U) osc(U) mdc(U) obdfilter(U) fsfilt_ldiskfs(U) ldiskfs(U) ost 
(U) ptlrpc(U) obdclass(U) lvfs(U) ksocknal(U) portals(U) libcfs(U) nfs 
(U) lockd(U) netconsole(U) netdump(U) loop(U) autofs4(U) i2c_dev(U)  
i2c_core(U) sunrpc(U) dm_mod(U) button(U) battery(U) ac(U) ohci_hcd 
(U) ehci_hcd(U) tg3(U) e100(U) mii(U) bonding(U) sg(U) qla2300(U)  
qla2xxx(U)(U) scsi_mod<ffffffff80137d62>{panic+211}
RSP: 0018:0000010103549a98  EFLAGS: 00010086
RAX: 000000000000006e RBX: ffffffff80359f91 RCX: 0000000000012d97
RDX: 00000000ffffff01 RSI: 0000000000012d97 RDI: ffffffff8041a360
RBP: 0000000000000010 R08: 0000000000000004 R09: 000001010354978c
R10: 00000000000000c3 R11: ffffffff802d822c R12: 0000000000000000
R13: 0000010104df8680 R14: 0000010102c13000 R15: 0000000000000005
FS:  0000002a95564360(0000) GS:ffffffff80521200(0000) knlGS: 
00000000f7fafbb0
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000f75ae000 CR3: 0000000000101000 CR4: 00000000000006e0
Process ll_ost_29 (pid: 14357, threadinfo 0000010103548000, task  
000001010343cb20)
Stack: 0000003000000020 0000010103549b78 0000010103549ab8  
00000000000010a1
        0000010104c9f710 ffffffff803602c4 000000000000006c  
0000010104df86e0
        0000000000000000 0000000000000000
Call Trace:<ffffffffa06ec6f1>{:ost:obd_preprw+1093}
<ffffffff80134b5c>
{add_wait_queue+185}
        <ffffffffa06f36c9>{:ost:ost_brw_write+8475}
<ffffffff80133742>
{default_wake_function+0}
        <ffffffffa06ecd66>{:ost:ost_bulk_timeout+0}
<ffffffffa06fd261>
{:ost:ost_handle+25648}
        <ffffffffa062889b>{:ptlrpc:ptlrpc_server_handle_request+5128}
        <ffffffffa01f2a35>{:libcfs:lcw_update_time+26}  
<ffffffffa062b112>{:ptlrpc:ptlrpc_main+4027}
        <ffffffff80133742>{default_wake_function+0}
<ffffffffa062a14a>
{:ptlrpc:ptlrpc_retry_rqbds+0}
        <ffffffffa062a14a>{:ptlrpc:ptlrpc_retry_rqbds+0}  
<ffffffff80111373>{child_rip+8}
        <ffffffffa062a157>{:ptlrpc:ptlrpc_main+0} <ffffffff8011136b>
{child_rip+0}


Code: 0f 0b ad b5 35 80 ff ff ff ff 4a 00 31 ff e8 47 bf fe ff 31
RIP <ffffffff80137d62>{panic+211} RSP <0000010103549a98>
------------------------------------------------------------------------ 
----------------------------------------------------------------------

Lustre: A connection with 10.13.16.46 timed out; the network or that  
node may be down.
LustreError: 20897:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts 
()) Timeout out conn->0xa0d102e ip 10.13.16.46:988
LustreError: 20896:0:(socknal_lib-linux.c: 
813:ksocknal_lib_connect_sock()) Error -113 connecting  
10.13.17.198/1022 -> 10.13.16.46/988
LustreError: Host 10.13.16.46 was unreachable; the network or that  
node may be down, or Lustre may be misconfigured.
LustreError: 20896:0:(socknal_cb.c:2103:ksocknal_autoconnect())  
Deleting packet type 1 len 64 (0xa0d11c6 10.13.17.198->0xa0d102e  
10.13.17.198)
LustreError: 20896:0:(events.c:61:request_out_callback()) @@@ type 8,  
status 19 req@000001012757fc00 x56/t0 o400->mds- 
test_UUID@NID_hpcio6.local_UUID:12 lens 64/64 ref 2 fl Rpc:N/0/0 rc 0/0
LustreError: 21305:0:(client.c:945:ptlrpc_expire_one_request()) @@@  
timeout (sent at 1138832833, 3s ago) req@000001012757fc00 x56/t0 o400- 
 >mds-test_UUID@NID_hpcio6.local_UUID:12 lens 64/64 ref 1 fl Rpc:N/ 
0/0 rc 0/0
LustreError: Connection to service mds-test via nid 10.13.16.46 was  
lost; in progress operations using this service will wait for  
recovery to complete.
Lustre: 21305:0:(import.c:142:ptlrpc_set_import_discon())  
MDC_hpc198.local_mds-test_MNT_hpc198: connection lost to mds- 
test_UUID@NID_hpcio6.local_UUID
LustreError: 21306:0:(lib-move.c:1510:lib_api_put()) Error sending  
PUT to 0xa0d102e: 19
LustreError: 20896:0:(socknal_lib-linux.c: 
813:ksocknal_lib_connect_sock()) Error -113 connecting  
10.13.17.198/1022 -> 10.13.16.46/988
LustreError: Host 10.13.16.46 was unreachable; the network or that  
node may be down, or Lustre may be misconfigured.
LustreError: 20896:0:(socknal_cb.c:2103:ksocknal_autoconnect())  
Deleting packet type 1 len 240 (0xa0d11c6 10.13.17.198->0xa0d102e  
10.13.17.198)
LustreError: 20896:0:(socknal_cb.c:2103:ksocknal_autoconnect())  
previously skipped 2 similar messages

Lustre discuss - May 2006 - MDS/OST node kernel panics on file writes by client.

[Lustre-discuss] MDS/OST node kernel panics on file writes by client.

[Lustre-discuss] MDS/OST node kernel panics on file writes by client.

[Lustre-discuss] MDS/OST node kernel panics on file writes by client.

[Lustre-discuss] MDS/OST node kernel panics on file writes by client.

[Lustre-discuss] MDS/OST node kernel panics on file writes by client.

[Lustre-discuss] MDS/OST node kernel panics on file writes by client.