Michael Schwartzkopff
2009-Oct-14 08:08 UTC
[Lustre-discuss] Problem re-mounting Lustre on an other node
Hi, we have a Lustre 1.8 Cluster with openais and pacemaker as the cluster manager. When I migrate one lustre resource from one node to an other node I get an error. Stopping lustre on one node is no problem, but the node where lustre should start says: Oct 14 09:54:28 sososd6 kernel: kjournald starting. Commit interval 5 seconds Oct 14 09:54:28 sososd6 kernel: LDISKFS FS on dm-4, internal journal Oct 14 09:54:28 sososd6 kernel: LDISKFS-fs: recovery complete. Oct 14 09:54:28 sososd6 kernel: LDISKFS-fs: mounted filesystem with ordered data mode. Oct 14 09:54:28 sososd6 multipathd: dm-4: umount map (uevent) Oct 14 09:54:39 sososd6 kernel: kjournald starting. Commit interval 5 seconds Oct 14 09:54:39 sososd6 kernel: LDISKFS FS on dm-4, internal journal Oct 14 09:54:39 sososd6 kernel: LDISKFS-fs: mounted filesystem with ordered data mode. Oct 14 09:54:39 sososd6 kernel: LDISKFS-fs: file extents enabled Oct 14 09:54:39 sososd6 kernel: LDISKFS-fs: mballoc enabled Oct 14 09:54:39 sososd6 kernel: Lustre: MGC134.171.16.190 at tcp: Reactivating import Oct 14 09:54:45 sososd6 kernel: LustreError: 137-5: UUID ''segfs-OST0000_UUID'' is not available for connect (no target) Oct 14 09:54:45 sososd6 kernel: LustreError: Skipped 3 previous similar messages Oct 14 09:54:45 sososd6 kernel: LustreError: 31334:0: (ldlm_lib.c:1850:target_send_reply_msg()) @@@ processing error (-19) req at ffff810225fcb800 x334514011/t0 o8-><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1255506985 ref 1 fl Interpret:/0/0 rc -19/0 Oct 14 09:54:45 sososd6 kernel: LustreError: 31334:0: (ldlm_lib.c:1850:target_send_reply_msg()) Skipped 3 previous similar messages These log continue until the cluster software times out and the resource tells me about the error. Any help understanding these logs? Thanks. -- Dr. Michael Schwartzkopff MultiNET Services GmbH Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany Tel: +49 - 89 - 45 69 11 0 Fax: +49 - 89 - 45 69 11 21 mob: +49 - 174 - 343 28 75 mail: misch at multinet.de web: www.multinet.de Sitz der Gesellschaft: 85630 Grasbrunn Registergericht: Amtsgericht M?nchen HRB 114375 Gesch?ftsf?hrer: G?nter Jurgeneit, Hubert Martens --- PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B Skype: misch42
Bernd Schubert
2009-Oct-14 11:47 UTC
[Lustre-discuss] Problem re-mounting Lustre on an other node
On Wednesday 14 October 2009, Michael Schwartzkopff wrote:> Hi, > > we have a Lustre 1.8 Cluster with openais and pacemaker as the cluster > manager. When I migrate one lustre resource from one node to an other node > I get an error. Stopping lustre on one node is no problem, but the node > where lustre should start says: > > Oct 14 09:54:28 sososd6 kernel: kjournald starting. Commit interval 5 > seconds Oct 14 09:54:28 sososd6 kernel: LDISKFS FS on dm-4, internal > journal Oct 14 09:54:28 sososd6 kernel: LDISKFS-fs: recovery complete. > Oct 14 09:54:28 sososd6 kernel: LDISKFS-fs: mounted filesystem with ordered > data mode. > Oct 14 09:54:28 sososd6 multipathd: dm-4: umount map (uevent) > Oct 14 09:54:39 sososd6 kernel: kjournald starting. Commit interval 5 > seconds Oct 14 09:54:39 sososd6 kernel: LDISKFS FS on dm-4, internal > journal Oct 14 09:54:39 sososd6 kernel: LDISKFS-fs: mounted filesystem > with ordered data mode. > Oct 14 09:54:39 sososd6 kernel: LDISKFS-fs: file extents enabled > Oct 14 09:54:39 sososd6 kernel: LDISKFS-fs: mballoc enabled > Oct 14 09:54:39 sososd6 kernel: Lustre: MGC134.171.16.190 at tcp: Reactivating[...]> > These log continue until the cluster software times out and the resource > tells me about the error. Any help understanding these logs? Thanks. >What is your start timeout? Do you see mount in the process list? I guess you just need to increase the timeout, I usually set at least 10 minutes, sometimes even 20 minutes. Also see my bug report and if possible add further information yourself. https://bugzilla.lustre.org/show_bug.cgi?id=20402 Thanks, Bernd -- Bernd Schubert DataDirect Networks
Bernd Schubert
2009-Oct-14 12:44 UTC
[Lustre-discuss] Problem re-mounting Lustre on an other node
On Wednesday 14 October 2009, Michael Schwartzkopff wrote:> > We have timeouts of 60 seconds. But we will move to 300. Thanks for the > hint.Check out my bug report, that might not be sufficient. -- Bernd Schubert DataDirect Networks
Andreas Dilger
2009-Oct-15 00:50 UTC
[Lustre-discuss] Problem re-mounting Lustre on an other node
On 14-Oct-09, at 01:08, Michael Schwartzkopff wrote:> we have a Lustre 1.8 Cluster with openais and pacemaker as the cluster > manager. When I migrate one lustre resource from one node to an > other node I > get an error. Stopping lustre on one node is no problem, but the > node where > lustre should start says: > > Oct 14 09:54:28 sososd6 kernel: kjournald starting. Commit interval > 5 seconds > Oct 14 09:54:28 sososd6 kernel: LDISKFS FS on dm-4, internal journal > Oct 14 09:54:28 sososd6 kernel: LDISKFS-fs: recovery complete. > Oct 14 09:54:28 sososd6 kernel: LDISKFS-fs: mounted filesystem with > ordered > data mode. > Oct 14 09:54:28 sososd6 multipathd: dm-4: umount map (uevent) > Oct 14 09:54:39 sososd6 kernel: kjournald starting. Commit interval > 5 seconds > Oct 14 09:54:39 sososd6 kernel: LDISKFS FS on dm-4, internal journal > Oct 14 09:54:39 sososd6 kernel: LDISKFS-fs: mounted filesystem with > ordered > data mode. > Oct 14 09:54:39 sososd6 kernel: LDISKFS-fs: file extents enabled > Oct 14 09:54:39 sososd6 kernel: LDISKFS-fs: mballoc enabled > Oct 14 09:54:39 sososd6 kernel: Lustre: MGC134.171.16.190 at tcp: > Reactivating > import > Oct 14 09:54:45 sososd6 kernel: LustreError: 137-5: UUID ''segfs- > OST0000_UUID'' > is not available for connect (no target)This is likely driven by some client trying to connect to OST0000, but I don''t see anything in the above logs that indicate that OST0000 has actually started up yet. It should have something like: RECOVERY: service myth-OST0000, 3 recoverable clients, last_rcvd 17180097556 Lustre: OST myth-OST0000 now serving dev (myth-OST0000/81a23803-0711- a534-441a-f5ee34e094a8), but will be in recovery for at least 5:00, or until 3 clients reconnect. Lustre: Server myth-OST0000 on device /dev/mapper/vgmyth-lvmythost0 has started> These log continue until the cluster software times out and the > resource tells > me about the error. Any help understanding these logs? Thanks.Are you sure you are mounting the OSTs with type "lustre" instead of "ldiskfs"? I see the above Lustre messages on my system a few seconds after the LDISKFS messages are printed. If you are using MMP (which you should be, on an automated failover config) it will add 10-20s of delay to the ldiskfs mount. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.