thr3ads.net - Lustre discuss - [Lustre-discuss] Understanding of MMP [Oct 2009]

If this information is useful, please help other people find it:
Share via:

Michael Schwartzkopff

2009-Oct-19 14:46 UTC

[Lustre-discuss] Understanding of MMP

Hi,

perhaps I have a problem understanding multiple mount protection MMP. I have a 
cluster. When a failover happens sometimes I get the log entry:

Oct 19 15:16:08 sososd7 kernel: LDISKFS-fs warning (device dm-2): 
ldiskfs_multi_mount_protect: Device is already active on another node.
Oct 19 15:16:08 sososd7 kernel: LDISKFS-fs warning (device dm-2): 
ldiskfs_multi_mount_protect: MMP failure info: last update time: 1255958168, 
last update node: sososd3, last update device: dm-2

Does the second line mean that my node (sososd7) tried to mount /dev/dm-2 but 
MMP prevented it from doing so because the last update from the old node 
(sososd3) was too recent?

From the manuals I found the MMP time of 109 seconds? Is it correct that after 
the umount the next node cannot mount the same filesystem within 10 seconds?

So the solution would be to wait fotr 10 seconds mounting the resource on the 
next node. Is this correct?

Thanks.

-- 
Dr. Michael Schwartzkopff
MultiNET Services GmbH
Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany
Tel: +49 - 89 - 45 69 11 0
Fax: +49 - 89 - 45 69 11 21
mob: +49 - 174 - 343 28 75

mail: misch at multinet.de
web: www.multinet.de

Sitz der Gesellschaft: 85630 Grasbrunn
Registergericht: Amtsgericht M?nchen HRB 114375
Gesch?ftsf?hrer: G?nter Jurgeneit, Hubert Martens

---

PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B
Skype: misch42

Andreas Dilger

2009-Oct-19 18:14 UTC

head link

[Lustre-discuss] Understanding of MMP

On 19-Oct-09, at 08:46, Michael Schwartzkopff wrote:> perhaps I have a problem understanding multiple mount protection  
> MMP. I have a
> cluster. When a failover happens sometimes I get the log entry:
>
> Oct 19 15:16:08 sososd7 kernel: LDISKFS-fs warning (device dm-2):
> ldiskfs_multi_mount_protect: Device is already active on another node.
> Oct 19 15:16:08 sososd7 kernel: LDISKFS-fs warning (device dm-2):
> ldiskfs_multi_mount_protect: MMP failure info: last update time:  
> 1255958168,
> last update node: sososd3, last update device: dm-2
>
> Does the second line mean that my node (sososd7) tried to mount /dev/ 
> dm-2 but
> MMP prevented it from doing so because the last update from the old  
> node
> (sososd3) was too recent?
The update time stored in the MMP block is purely for informational
purposes.  It actually uses a sequence counter that has nothing to do
with the system clock on either of the nodes (since they may not be in
sync).

What that message actually means is that sososd7 tried to mount the
filesystem on dm-2 (which likely has another "LVM" name that the
kernel
doesn''t know anything about) but the MMP block on the disk was modified
by sososd3 AFTER sososd7 first looked at it.

That is a very bad thing, and is exactly what MMP is designed to detect.
Is sososd3 still running at this point, or has it been STONITH''d?
>> From the manuals I found the MMP time of 109 seconds? Is it correct  
>> that after the umount the next node cannot mount the same  
>> filesystem within 10 seconds?
You wrote "109" seconds...  Did you really mean "10"
seconds?  In any
case,
the default MMP timeout is 5s (unless high system load forces this to be
larger), and the mounting node waits 2x this interval to ensure that the
other node has at least one full interval in which to write a new MMP  
block.
> So the solution would be to wait fotr 10 seconds mounting the  
> resource on the
> next node. Is this correct?

If the other node is still using the filesystem, then waiting 10s will  
not
help.  The HA software needs to power off (STONITH) the previous node  
before
it starts a failover.  Otherwise, there may be any number of blocks  
still in
cache or in the IO elevator that might land on the disk after the  
takeover,
if the "failing" node is very slow but not quite dead.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Bernd Schubert

2009-Oct-19 18:42 UTC

head link

[Lustre-discuss] Understanding of MMP

On Monday 19 October 2009, Andreas Dilger wrote:> On 19-Oct-09, at 08:46, Michael Schwartzkopff wrote:
> > perhaps I have a problem understanding multiple mount protection
> > MMP. I have a
> > cluster. When a failover happens sometimes I get the log entry:
> >
> > Oct 19 15:16:08 sososd7 kernel: LDISKFS-fs warning (device dm-2):
> > ldiskfs_multi_mount_protect: Device is already active on another node.
> > Oct 19 15:16:08 sososd7 kernel: LDISKFS-fs warning (device dm-2):
> > ldiskfs_multi_mount_protect: MMP failure info: last update time:
> > 1255958168,
> > last update node: sososd3, last update device: dm-2
> >
> > Does the second line mean that my node (sososd7) tried to mount /dev/
> > dm-2 but
> > MMP prevented it from doing so because the last update from the old
> > node
> > (sososd3) was too recent?
> 
> The update time stored in the MMP block is purely for informational
> purposes.  It actually uses a sequence counter that has nothing to do
> with the system clock on either of the nodes (since they may not be in
> sync).
> 
> What that message actually means is that sososd7 tried to mount the
> filesystem on dm-2 (which likely has another "LVM" name that the
kernel
> doesn''t know anything about) but the MMP block on the disk was
modified
> by sososd3 AFTER sososd7 first looked at it.
Probably, bug#19566. Michael, which Lustre version do you exactly use?


Thanks,
Bernd


-- 
Bernd Schubert
DataDirect Networks

Michael Schwartzkopff

2009-Oct-19 19:08 UTC

head link

[Lustre-discuss] Understanding of MMP

Am Montag, 19. Oktober 2009 20:42:19 schrieben Sie:> On Monday 19 October 2009, Andreas Dilger wrote:
> > On 19-Oct-09, at 08:46, Michael Schwartzkopff wrote:
> > > perhaps I have a problem understanding multiple mount protection
> > > MMP. I have a
> > > cluster. When a failover happens sometimes I get the log entry:
> > >
> > > Oct 19 15:16:08 sososd7 kernel: LDISKFS-fs warning (device dm-2):
> > > ldiskfs_multi_mount_protect: Device is already active on another
node.
> > > Oct 19 15:16:08 sososd7 kernel: LDISKFS-fs warning (device dm-2):
> > > ldiskfs_multi_mount_protect: MMP failure info: last update time:
> > > 1255958168,
> > > last update node: sososd3, last update device: dm-2
> > >
> > > Does the second line mean that my node (sososd7) tried to mount
/dev/
> > > dm-2 but
> > > MMP prevented it from doing so because the last update from the
old
> > > node
> > > (sososd3) was too recent?
> >
> > The update time stored in the MMP block is purely for informational
> > purposes.  It actually uses a sequence counter that has nothing to do
> > with the system clock on either of the nodes (since they may not be in
> > sync).
> >
> > What that message actually means is that sososd7 tried to mount the
> > filesystem on dm-2 (which likely has another "LVM" name that
the kernel
> > doesn''t know anything about) but the MMP block on the disk
was modified
> > by sososd3 AFTER sososd7 first looked at it.
>
> Probably, bug#19566. Michael, which Lustre version do you exactly use?
>
>
> Thanks,
> Bernd
I got version 1.8.1.1 which was published last week. Is the fix included or 
only in 1.8.2?

Greetings,
-- 
Dr. Michael Schwartzkopff
MultiNET Services GmbH
Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany
Tel: +49 - 89 - 45 69 11 0
Fax: +49 - 89 - 45 69 11 21
mob: +49 - 174 - 343 28 75

mail: misch at multinet.de
web: www.multinet.de

Sitz der Gesellschaft: 85630 Grasbrunn
Registergericht: Amtsgericht M?nchen HRB 114375
Gesch?ftsf?hrer: G?nter Jurgeneit, Hubert Martens

---

PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B
Skype: misch42

Bernd Schubert

2009-Oct-19 19:54 UTC

head link

[Lustre-discuss] Understanding of MMP

On Monday 19 October 2009, Michael Schwartzkopff wrote:> Am Montag, 19. Oktober 2009 20:42:19 schrieben Sie:
> > On Monday 19 October 2009, Andreas Dilger wrote:
> > > On 19-Oct-09, at 08:46, Michael Schwartzkopff wrote:
> > > > perhaps I have a problem understanding multiple mount
protection
> > > > MMP. I have a
> > > > cluster. When a failover happens sometimes I get the log
entry:
> > > >
> > > > Oct 19 15:16:08 sososd7 kernel: LDISKFS-fs warning (device
dm-2):
> > > > ldiskfs_multi_mount_protect: Device is already active on
another
> > > > node. Oct 19 15:16:08 sososd7 kernel: LDISKFS-fs warning
(device
> > > > dm-2): ldiskfs_multi_mount_protect: MMP failure info: last
update
> > > > time: 1255958168,
> > > > last update node: sososd3, last update device: dm-2
> > > >
> > > > Does the second line mean that my node (sososd7) tried to
mount /dev/
> > > > dm-2 but
> > > > MMP prevented it from doing so because the last update from
the old
> > > > node
> > > > (sososd3) was too recent?
> > >
> > > The update time stored in the MMP block is purely for
informational
> > > purposes.  It actually uses a sequence counter that has nothing
to do
> > > with the system clock on either of the nodes (since they may not
be in
> > > sync).
> > >
> > > What that message actually means is that sososd7 tried to mount
the
> > > filesystem on dm-2 (which likely has another "LVM" name
that the kernel
> > > doesn''t know anything about) but the MMP block on the
disk was modified
> > > by sososd3 AFTER sososd7 first looked at it.
> >
> > Probably, bug#19566. Michael, which Lustre version do you exactly use?
> >
> >
> > Thanks,
> > Bernd
> 
> I got version 1.8.1.1 which was published last week. Is the fix included or
> only in 1.8.2?
According to the bugzilla (https://bugzilla.lustre.org/show_bug.cgi?id=19566) 
not yet in 1.8.1.1. Our ddn internal releases of course do have it. And from 
my point of view this is a really important fix. Ever since 1.6.7 there is 
also no chance anymore to figure out the unsuccessful umount from the resource 
agent (up to 1.6.6 /proc/fs/lustre/.../mntdev would tell you the device is 
still mounted).

To be sure this really your issue, do you see this in your kernel logs?

                CERROR("Mount %p is still busy (%d refs), giving
up.\n",
                       mnt, atomic_read(&mnt->mnt_count));



-- 
Bernd Schubert
DataDirect Networks

Lustre discuss - Oct 2009 - Understanding of MMP

[Lustre-discuss] Understanding of MMP

[Lustre-discuss] Understanding of MMP

[Lustre-discuss] Understanding of MMP

[Lustre-discuss] Understanding of MMP

[Lustre-discuss] Understanding of MMP