thr3ads.net - Lustre discuss - [Lustre-discuss] Problem re-mounting Lustre on an other node [Oct 2009]

If this information is useful, please help other people find it:
Share via:

Michael Schwartzkopff

2009-Oct-14 08:08 UTC

[Lustre-discuss] Problem re-mounting Lustre on an other node

Hi,

we have a Lustre 1.8 Cluster with openais and pacemaker as the cluster 
manager. When I migrate one lustre resource from one node to an other node I 
get an error. Stopping lustre on one node is no problem, but the node where 
lustre should start says:

Oct 14 09:54:28 sososd6 kernel: kjournald starting.  Commit interval 5 seconds
Oct 14 09:54:28 sososd6 kernel: LDISKFS FS on dm-4, internal journal
Oct 14 09:54:28 sososd6 kernel: LDISKFS-fs: recovery complete.
Oct 14 09:54:28 sososd6 kernel: LDISKFS-fs: mounted filesystem with ordered 
data mode.
Oct 14 09:54:28 sososd6 multipathd: dm-4: umount map (uevent)
Oct 14 09:54:39 sososd6 kernel: kjournald starting.  Commit interval 5 seconds
Oct 14 09:54:39 sososd6 kernel: LDISKFS FS on dm-4, internal journal
Oct 14 09:54:39 sososd6 kernel: LDISKFS-fs: mounted filesystem with ordered 
data mode.
Oct 14 09:54:39 sososd6 kernel: LDISKFS-fs: file extents enabled
Oct 14 09:54:39 sososd6 kernel: LDISKFS-fs: mballoc enabled
Oct 14 09:54:39 sososd6 kernel: Lustre: MGC134.171.16.190 at tcp: Reactivating 
import
Oct 14 09:54:45 sososd6 kernel: LustreError: 137-5: UUID
''segfs-OST0000_UUID''
is not available  for connect (no target)
Oct 14 09:54:45 sososd6 kernel: LustreError: Skipped 3 previous similar 
messages
Oct 14 09:54:45 sososd6 kernel: LustreError: 31334:0:
(ldlm_lib.c:1850:target_send_reply_msg()) @@@ processing error (-19)  
req at ffff810225fcb800 x334514011/t0 o8-><?>@<?>:0/0 lens 368/0
e 0 to 0 dl
1255506985 ref 1 fl Interpret:/0/0 rc -19/0
Oct 14 09:54:45 sososd6 kernel: LustreError: 31334:0:
(ldlm_lib.c:1850:target_send_reply_msg()) Skipped 3 previous similar messages

These log continue until the cluster software times out and the resource tells 
me about the error. Any help understanding these logs? Thanks.

-- 
Dr. Michael Schwartzkopff
MultiNET Services GmbH
Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany
Tel: +49 - 89 - 45 69 11 0
Fax: +49 - 89 - 45 69 11 21
mob: +49 - 174 - 343 28 75

mail: misch at multinet.de
web: www.multinet.de

Sitz der Gesellschaft: 85630 Grasbrunn
Registergericht: Amtsgericht M?nchen HRB 114375
Gesch?ftsf?hrer: G?nter Jurgeneit, Hubert Martens

---

PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B
Skype: misch42

Bernd Schubert

2009-Oct-14 11:47 UTC

head link

[Lustre-discuss] Problem re-mounting Lustre on an other node

On Wednesday 14 October 2009, Michael Schwartzkopff
wrote:> Hi,
> 
> we have a Lustre 1.8 Cluster with openais and pacemaker as the cluster
> manager. When I migrate one lustre resource from one node to an other node
>  I get an error. Stopping lustre on one node is no problem, but the node
>  where lustre should start says:
> 
> Oct 14 09:54:28 sososd6 kernel: kjournald starting.  Commit interval 5
>  seconds Oct 14 09:54:28 sososd6 kernel: LDISKFS FS on dm-4, internal
>  journal Oct 14 09:54:28 sososd6 kernel: LDISKFS-fs: recovery complete.
> Oct 14 09:54:28 sososd6 kernel: LDISKFS-fs: mounted filesystem with ordered
> data mode.
> Oct 14 09:54:28 sososd6 multipathd: dm-4: umount map (uevent)
> Oct 14 09:54:39 sososd6 kernel: kjournald starting.  Commit interval 5
>  seconds Oct 14 09:54:39 sososd6 kernel: LDISKFS FS on dm-4, internal
>  journal Oct 14 09:54:39 sososd6 kernel: LDISKFS-fs: mounted filesystem
>  with ordered data mode.
> Oct 14 09:54:39 sososd6 kernel: LDISKFS-fs: file extents enabled
> Oct 14 09:54:39 sososd6 kernel: LDISKFS-fs: mballoc enabled
> Oct 14 09:54:39 sososd6 kernel: Lustre: MGC134.171.16.190 at tcp:
Reactivating
[...]
> 
> These log continue until the cluster software times out and the resource
>  tells me about the error. Any help understanding these logs? Thanks.
> 
What is your start timeout? Do you see mount in the process list? I guess you 
just need to increase the timeout, I usually set at least 10 minutes, 
sometimes even 20 minutes. Also see my bug report and if possible add further 
information yourself.

https://bugzilla.lustre.org/show_bug.cgi?id=20402



Thanks,
Bernd

-- 
Bernd Schubert
DataDirect Networks

Bernd Schubert

2009-Oct-14 12:44 UTC

head link

[Lustre-discuss] Problem re-mounting Lustre on an other node

On Wednesday 14 October 2009, Michael Schwartzkopff wrote:
> 
> We have timeouts of 60 seconds. But we will move to 300. Thanks for the
>  hint.
Check out my bug report, that might not be sufficient.



-- 
Bernd Schubert
DataDirect Networks

Andreas Dilger

2009-Oct-15 00:50 UTC

head link

[Lustre-discuss] Problem re-mounting Lustre on an other node

On 14-Oct-09, at 01:08, Michael Schwartzkopff wrote:> we have a Lustre 1.8 Cluster with openais and pacemaker as the cluster
> manager. When I migrate one lustre resource from one node to an  
> other node I
> get an error. Stopping lustre on one node is no problem, but the  
> node where
> lustre should start says:
>
> Oct 14 09:54:28 sososd6 kernel: kjournald starting.  Commit interval  
> 5 seconds
> Oct 14 09:54:28 sososd6 kernel: LDISKFS FS on dm-4, internal journal
> Oct 14 09:54:28 sososd6 kernel: LDISKFS-fs: recovery complete.
> Oct 14 09:54:28 sososd6 kernel: LDISKFS-fs: mounted filesystem with  
> ordered
> data mode.
> Oct 14 09:54:28 sososd6 multipathd: dm-4: umount map (uevent)
> Oct 14 09:54:39 sososd6 kernel: kjournald starting.  Commit interval  
> 5 seconds
> Oct 14 09:54:39 sososd6 kernel: LDISKFS FS on dm-4, internal journal
> Oct 14 09:54:39 sososd6 kernel: LDISKFS-fs: mounted filesystem with  
> ordered
> data mode.
> Oct 14 09:54:39 sososd6 kernel: LDISKFS-fs: file extents enabled
> Oct 14 09:54:39 sososd6 kernel: LDISKFS-fs: mballoc enabled
> Oct 14 09:54:39 sososd6 kernel: Lustre: MGC134.171.16.190 at tcp:  
> Reactivating
> import
> Oct 14 09:54:45 sososd6 kernel: LustreError: 137-5: UUID ''segfs- 
> OST0000_UUID''
> is not available  for connect (no target)
This is likely driven by some client trying to connect to OST0000, but  
I don''t
see anything in the above logs that indicate that OST0000 has actually  
started
up yet.  It should have something like:

RECOVERY: service myth-OST0000, 3 recoverable clients, last_rcvd  
17180097556
Lustre: OST myth-OST0000 now serving dev (myth-OST0000/81a23803-0711- 
a534-441a-f5ee34e094a8), but will be in recovery for at least 5:00, or  
until 3 clients reconnect.
Lustre: Server myth-OST0000 on device /dev/mapper/vgmyth-lvmythost0  
has started
> These log continue until the cluster software times out and the  
> resource tells
> me about the error. Any help understanding these logs? Thanks.

Are you sure you are mounting the OSTs with type "lustre" instead of  
"ldiskfs"?
I see the above Lustre messages on my system a few seconds after the  
LDISKFS
messages are printed.

If you are using MMP (which you should be, on an automated failover  
config)
it will add 10-20s of delay to the ldiskfs mount.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Lustre discuss - Oct 2009 - Problem re-mounting Lustre on an other node

[Lustre-discuss] Problem re-mounting Lustre on an other node

[Lustre-discuss] Problem re-mounting Lustre on an other node

[Lustre-discuss] Problem re-mounting Lustre on an other node

[Lustre-discuss] Problem re-mounting Lustre on an other node