thr3ads.net - Lustre discuss - [Lustre-discuss] can''t find mdc_open

If this information is useful, please help other people find it:
Share via:

Nico.Budewitz

2009-Jan-28 10:14 UTC

[Lustre-discuss] can''t find mdc_open_data, but the close succeeded

Hi,

I had strange problems this morning:

A parallel application  on 256 cores has been started and recovered 
512GB from the last checkpoint this morning at 4.03. This application 
had to read the last checkpoint (256x2GB) from lustre into the main memory.

It seems that the filesystem was hanging for a while and other 
applications which were using lustre at the same time could not open 
files anymore and crashed instantly .

I received this error messages on the master node of the 256 core job:

Jan 28 04:04:07 node005 kernel: Lustre: Request x1138068 sent from 
aeifs2-MDT0000-mdc-ffff81021db45400 to NID 10.13.32.12 at o2ib 8s ago has 
timed out (limit 8s).
Jan 28 04:04:24 node005 kernel: Lustre: 
8957:0:(import.c:410:import_select_connection()) 
aeifs2-MDT0000-mdc-ffff81021db45400: tried all connections, increasing 
latenc
y to 6s
Jan 28 04:04:24 node005 kernel: Lustre: Changing connection for 
aeifs2-MDT0000-mdc-ffff81021db45400 to 10.13.32.11 at o2ib/10.13.32.11 at o2ib
Jan 28 04:04:24 node005 kernel: LustreError: 167-0: This client was 
evicted by aeifs2-MDT0000; in progress operations using this service 
will fail.
Jan 28 04:04:24 node005 kernel: LustreError: 
9022:0:(file.c:116:ll_close_inode_openhandle()) inode 4262393 mdc close 
failed: rc = -5
Jan 28 04:04:24 node005 kernel: LustreError: 
10503:0:(client.c:722:ptlrpc_import_delay_req()) @@@ IMP_INVALID  
req at ffff8101f337be00 x1138066/t0 o35->aeifs2-MDT0000_UU
ID at 10.13.32.11@o2ib:23/10 lens 296/1248 e 0 to 6 dl 0 ref 1 fl Rpc:/0/0 
rc 0/0
Jan 28 04:04:24 node005 kernel: LustreError: 
9022:0:(file.c:116:ll_close_inode_openhandle()) Skipped 1 previous 
similar message
Jan 28 04:04:24 node005 kernel: LustreError: 
15629:0:(mdc_locks.c:598:mdc_enqueue()) ldlm_cli_enqueue: -5
Jan 28 04:04:24 node005 kernel: Lustre: 
aeifs2-MDT0000-mdc-ffff81021db45400: Connection restored to service 
aeifs2-MDT0000 using nid 10.13.32.11 at o2ib.
Jan 28 04:04:25 node005 kernel: LustreError: 
15629:0:(mdc_request.c:741:mdc_close()) Unexpected: can''t find 
mdc_open_data, but the close succeeded.  Please tell <http
://bugzilla.lustre.org/>.
Jan 28 04:04:25 node005 kernel: LustreError: 
10503:0:(mdc_request.c:741:mdc_close()) Unexpected: can''t find 
mdc_open_data, but the close succeeded.  Please tell <http
://bugzilla.lustre.org/>.
Jan 28 04:04:26 node005 kernel: LustreError: 
15629:0:(mdc_request.c:741:mdc_close()) Unexpected: can''t find 
mdc_open_data, but the close succeeded.  Please tell <http
://bugzilla.lustre.org/>.
Jan 28 04:04:27 node005 kernel: LustreError: 11-0: an error occurred 
while communicating with 10.13.32.11 at o2ib. The mds_close operation 
failed with -116
Jan 28 04:04:27 node005 kernel: LustreError: 
10506:0:(file.c:116:ll_close_inode_openhandle()) inode 7747211 mdc close 
failed: rc = -116
Jan 28 04:04:27 node005 kernel: LustreError: 
10506:0:(file.c:116:ll_close_inode_openhandle()) Skipped 25 previous 
similar messages

I am using lustre 1.6.6 with adaptive timeouts and ofed 1.3.1. Do you 
have an idea what''s going wrong? Looks like an Ib network problem or
are
the OST''s to slow in this special case? I had never problems before.


So long!

"Du skal ikke g? over bekken etter vann"

Nico Budewitz
Cluster Administrator
Max-Planck-Institute for Gravitational Physics / Albert-Einstein-Institute
Am Muehlenberg 1, 14476 Golm
Tel.: +49 (0)331 567 7364 Fax: +49 (0)331 567 7297
http://supercomputers.aei.mpg.de

Arden Wiebe

2009-Jan-28 13:15 UTC

head link

[Lustre-discuss] can''t find mdc_open_data, but the close succeeded

Please read the last paragraph before implementing any suggestions below.

I would say you have an interconnect problem with your MGS node.  Examine any
indicators of port state for the given NID 10.13.32.12 and note any deficiencies
and correct.  Don''t overlook the switching or cabling.  Mediate by
doing a network restart for the 10.13.32.12 node trailing your messages watching
for signs of reconnection.  It sometimes takes awhile for all clients to
reconnect after changing network values.

I produced similar errors when the MTU of an interface was incompatible with
lesser MTU values and or also incorrectly configured switch.  Optionally you can
unload lctl rmmod I believe the lustre modules.  Failing that reboot or failover
the node and observe message logs on MGS MDT and Clients as node comes back up. 
Continue to change network values or hardware until communication is restored. 
Observe switch configurations.

Your logs clearly show your MGS node receiving intermittent network connectivity
leading to incomplete communication between the MDT and client nodes causing
client Node5 to hang as it never gets to fully reconnect or disconnect as the
MDT is telling it in this case.

Consider the filesystem to be safe on a reboot of the MGS but on reboot and even
before examine any and all partions or raid arrays and be sure they are mounted
appropriately on reboot.  Issue lctl dl immediately on reboot to verify MGS was
mounted correctly to rule out basic mounting errors of MGS.  Mount a client
directly from the MGS if you can''t connect you know it is network
related.

Also if 10.13.32.12 is your failover MGS pull the plug as it is not configured
properly either due to a switching, cabling or network configuration problem. 
If the focus on the MGS does not solve carry on down the line to the MDT and
look for anything out of the ordinary like an unmounted raid array or network
configuration deficiencies.  Hard Reboot client nodes you suspect are hung. 
Observe switch configuration data.

That is my interpretation.  Take it at for what it is worth but understand this.
I have very limited experience with Lustre.
 
--- On Wed, 1/28/09, Nico.Budewitz <Nico.Budewitz at aei.mpg.de> wrote:

From: Nico.Budewitz <Nico.Budewitz at aei.mpg.de>
Subject: [Lustre-discuss] can''t find mdc_open_data, but the close
succeeded
To: lustre-discuss at lists.lustre.org, "Nico Budewitz"
<nico.budewitz at aei.mpg.de>
Date: Wednesday, January 28, 2009, 2:14 AM

Hi,

I had strange problems this morning:

A parallel application? on 256 cores has been started and recovered 
512GB from the last checkpoint this morning at 4.03. This application 
had to read the last checkpoint (256x2GB) from lustre into the main memory.

It seems that the filesystem was hanging for a while and other 
applications which were using lustre at the same time could not open 
files anymore and crashed instantly .

I received this error messages on the master node of the 256 core job:

Jan 28 04:04:07 node005 kernel: Lustre: Request x1138068 sent from 
aeifs2-MDT0000-mdc-ffff81021db45400 to NID 10.13.32.12 at o2ib 8s ago has 
timed out (limit 8s).
Jan 28 04:04:24 node005 kernel: Lustre: 
8957:0:(import.c:410:import_select_connection()) 
aeifs2-MDT0000-mdc-ffff81021db45400: tried all connections, increasing 
latenc
y to 6s
Jan 28 04:04:24 node005 kernel: Lustre: Changing connection for 
aeifs2-MDT0000-mdc-ffff81021db45400 to 10.13.32.11 at o2ib/10.13.32.11 at o2ib
Jan 28 04:04:24 node005 kernel: LustreError: 167-0: This client was 
evicted by aeifs2-MDT0000; in progress operations using this service 
will fail.
Jan 28 04:04:24 node005 kernel: LustreError: 
9022:0:(file.c:116:ll_close_inode_openhandle()) inode 4262393 mdc close 
failed: rc = -5
Jan 28 04:04:24 node005 kernel: LustreError: 
10503:0:(client.c:722:ptlrpc_import_delay_req()) @@@ IMP_INVALID? 
req at ffff8101f337be00 x1138066/t0 o35->aeifs2-MDT0000_UU
ID at 10.13.32.11@o2ib:23/10 lens 296/1248 e 0 to 6 dl 0 ref 1 fl Rpc:/0/0 
rc 0/0
Jan 28 04:04:24 node005 kernel: LustreError: 
9022:0:(file.c:116:ll_close_inode_openhandle()) Skipped 1 previous 
similar message
Jan 28 04:04:24 node005 kernel: LustreError: 
15629:0:(mdc_locks.c:598:mdc_enqueue()) ldlm_cli_enqueue: -5
Jan 28 04:04:24 node005 kernel: Lustre: 
aeifs2-MDT0000-mdc-ffff81021db45400: Connection restored to service 
aeifs2-MDT0000 using nid 10.13.32.11 at o2ib.
Jan 28 04:04:25 node005 kernel: LustreError: 
15629:0:(mdc_request.c:741:mdc_close()) Unexpected: can''t find 
mdc_open_data, but the close succeeded.? Please tell <http
://bugzilla.lustre.org/>.
Jan 28 04:04:25 node005 kernel: LustreError: 
10503:0:(mdc_request.c:741:mdc_close()) Unexpected: can''t find 
mdc_open_data, but the close succeeded.? Please tell <http
://bugzilla.lustre.org/>.
Jan 28 04:04:26 node005 kernel: LustreError: 
15629:0:(mdc_request.c:741:mdc_close()) Unexpected: can''t find 
mdc_open_data, but the close succeeded.? Please tell <http
://bugzilla.lustre.org/>.
Jan 28 04:04:27 node005 kernel: LustreError: 11-0: an error occurred 
while communicating with 10.13.32.11 at o2ib. The mds_close operation 
failed with -116
Jan 28 04:04:27 node005 kernel: LustreError: 
10506:0:(file.c:116:ll_close_inode_openhandle()) inode 7747211 mdc close 
failed: rc = -116
Jan 28 04:04:27 node005 kernel: LustreError: 
10506:0:(file.c:116:ll_close_inode_openhandle()) Skipped 25 previous 
similar messages

I am using lustre 1.6.6 with adaptive timeouts and ofed 1.3.1. Do you 
have an idea what''s going wrong? Looks like an Ib network problem or
are
the OST''s to slow in this special case? I had never problems before.


So long!

"Du skal ikke g? over bekken etter vann"

Nico Budewitz
Cluster Administrator
Max-Planck-Institute for Gravitational Physics / Albert-Einstein-Institute
Am Muehlenberg 1, 14476 Golm
Tel.: +49 (0)331 567 7364 Fax: +49 (0)331 567 7297
http://supercomputers.aei.mpg.de




_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Lustre discuss - Jan 2009 - can't find mdc_open_data, but the close succeeded

[Lustre-discuss] can''t find mdc_open_data, but the close succeeded

[Lustre-discuss] can''t find mdc_open_data, but the close succeeded