thr3ads.net - Lustre discuss - [Lustre-discuss] Heartbeat, LVM and Lustre [Dec 2009]

If this information is useful, please help other people find it:
Share via:

Atul Vidwansa

2009-Dec-10 15:01 UTC

[Lustre-discuss] Heartbeat, LVM and Lustre

Experts,

I am trying to use Linux heartbeat (2.1.4 with v1 style resource 
configuration) with LVM to mount Lustre MDTs. My configuration is 
simple, ha.cf and haresources file is attached. I have an interesting 
observation. When I reboot MDS nodes and start MDTs with "service 
heartbeat start" simultaneously on both mds nodes, sometimes I get 
following message:

mds1: 2009/12/10_13:48:08 CRITICAL: Resource LVM::mgsvg is active, and 
should not be!
mds1: 2009/12/10_13:48:08 CRITICAL: Non-idle resources can affect data 
integrity!


mds2: 2009/12/10_13:47:08 CRITICAL: Resource LVM::mgsvg is active, and 
should not be!
mds2: 2009/12/10_13:47:08 CRITICAL: Non-idle resources can affect data 
integrity!


and heartbeat on both mds nodes does not start any resource (even after 
waiting for 35 minutes).
Has anyone seen this before?


/etc/ha.d/ha.cf :
===========use_logd    on
logfile        /var/log/ha-log
debugfile       /var/log/ha-debug
logfacility     local0
keepalive    2
deadtime    120
warntime    10
initdead    120
udpport     694
mcast        eth0 239.0.0.3 694 1 0
mcast           ib0 224.0.0.3 694 1 0
node        mds1
node        mds2
auto_failback    off
stonith_host mds1 external/ipmi mds2 mds2-sp root changeme lanplus
stonith_host mds2 external/ipmi mds1 mds1-sp root changeme lanplus

/etc/ha.d/haresources :
================mds1 LVM::mgsvg Filesystem::/dev/mgsvg/mgs::/lustre/mgs::lustre
mds1 LVM::home1vg Filesystem::/dev/home1vg/home1::/lustre/home1::lustre
mds1 LVM::data1vg Filesystem::/dev/data1vg/data1::/lustre/data1::lustre
mds2 LVM::flushvg Filesystem::/dev/flushvg/flush::/lustre/flush::lustre
mds2 LVM::data2vg Filesystem::/dev/data2vg/data2::/lustre/data2::lustre
mds2 LVM::home2vg Filesystem::/dev/home2vg/home2::/lustre/home2::lustre

Cheers,
_Atul

Jim Garlick

2009-Dec-10 19:06 UTC

head link

[Lustre-discuss] Heartbeat, LVM and Lustre

I''m hoping an actual expert will respond to your question and educate
us all, but in the mean time I had a thought:  is device-mapper
starting your volume groups on both systems, and then the LVM resource
agent script is correctly noticing this?

Maybe you need to configure LVM not to start automatically and instead
wait for the resource agent script to do it.

Jim

On Fri, Dec 11, 2009 at 02:01:24AM +1100, Atul Vidwansa
wrote:> Experts,
> 
> I am trying to use Linux heartbeat (2.1.4 with v1 style resource 
> configuration) with LVM to mount Lustre MDTs. My configuration is 
> simple, ha.cf and haresources file is attached. I have an interesting 
> observation. When I reboot MDS nodes and start MDTs with "service 
> heartbeat start" simultaneously on both mds nodes, sometimes I get 
> following message:
> 
> mds1: 2009/12/10_13:48:08 CRITICAL: Resource LVM::mgsvg is active, and 
> should not be!
> mds1: 2009/12/10_13:48:08 CRITICAL: Non-idle resources can affect data 
> integrity!
> 
> 
> mds2: 2009/12/10_13:47:08 CRITICAL: Resource LVM::mgsvg is active, and 
> should not be!
> mds2: 2009/12/10_13:47:08 CRITICAL: Non-idle resources can affect data 
> integrity!
> 
> 
> and heartbeat on both mds nodes does not start any resource (even after 
> waiting for 35 minutes).
> Has anyone seen this before?
> 
> 
> /etc/ha.d/ha.cf :
> ===========> use_logd    on
> logfile        /var/log/ha-log
> debugfile       /var/log/ha-debug
> logfacility     local0
> keepalive    2
> deadtime    120
> warntime    10
> initdead    120
> udpport     694
> mcast        eth0 239.0.0.3 694 1 0
> mcast           ib0 224.0.0.3 694 1 0
> node        mds1
> node        mds2
> auto_failback    off
> stonith_host mds1 external/ipmi mds2 mds2-sp root changeme lanplus
> stonith_host mds2 external/ipmi mds1 mds1-sp root changeme lanplus
> 
> /etc/ha.d/haresources :
> ================> mds1 LVM::mgsvg
Filesystem::/dev/mgsvg/mgs::/lustre/mgs::lustre
> mds1 LVM::home1vg Filesystem::/dev/home1vg/home1::/lustre/home1::lustre
> mds1 LVM::data1vg Filesystem::/dev/data1vg/data1::/lustre/data1::lustre
> mds2 LVM::flushvg Filesystem::/dev/flushvg/flush::/lustre/flush::lustre
> mds2 LVM::data2vg Filesystem::/dev/data2vg/data2::/lustre/data2::lustre
> mds2 LVM::home2vg Filesystem::/dev/home2vg/home2::/lustre/home2::lustre
> 
> Cheers,
> _Atul
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://*lists.lustre.org/mailman/listinfo/lustre-discuss

Brian J. Murrell

2009-Dec-10 20:29 UTC

head link

[Lustre-discuss] Heartbeat, LVM and Lustre

On Fri, 2009-12-11 at 02:01 +1100, Atul Vidwansa wrote: > 
> When I reboot MDS nodes and start MDTs with "service 
> heartbeat start" simultaneously on both mds nodes, sometimes I get 
> following message:
With both nodes up and running at the same time, likely they have both
done a vgscan; vgchange -a y on the shared disk(s).  I don''t know that
this is in itself a problem.  I do the same thing here and I have not
(yet) seen any ill effects.  I am far from an LVM expert however.
> mds1: 2009/12/10_13:48:08 CRITICAL: Resource LVM::mgsvg is active, and 
> should not be!
> mds1: 2009/12/10_13:48:08 CRITICAL: Non-idle resources can affect data 
> integrity!
I wonder how it''s determining that LVM:mgsvg is "active" and
what it
considers "active".  A look into the source for that would most likely
be very fruitful.  And it was.

It seems that "/usr/lib/ocf/resource.d/heartbeat/LVM status" is what
is
used to determine who owns the resource.  The LVM resource script does
that with a:

vgdisplay [-v if lvm version is >= 2 ] $volume 2>&1 | grep -i
''Status[ \t]*available''

What is interesting is on my LVM 2 system, vgdisplay with -v also shows
a:

  LV Status              available

for every volume in the VG.  I wonder if they are just not accounting
for that.  Or maybe that''s what they are looking for given that on my
active and in use LVM system here, for the VG itself, Status shows:

  VG Status             resizable

So they can''t be looking for an "available" in the VG Status
for
"resource ownership" and must want the LV Status line(s).

Looking a little further, the LVM script has both "start" and
"stop"
actions which presumably heartbeat invokes to (dis-)"own" a resource.
These two actions do:

vgscan; vgchange -a y $1

and

vgchange -a n $1

respectively.  That implies that heartbeat wants to own an entire VG or
nothing.  It would appear you cannot have multiple volumes from a single
VG owned by different nodes.  As I said, I do this myself and have found
no issues, but am not at all a heavy, or what I would call
"production"
user.
> and heartbeat on both mds nodes does not start any resource (even after 
> waiting for 35 minutes).
Well, it would seem that heartbeat has found a condition it considers
dangerous and stopping there so as not to cause any damage.  From the
looks of things, you will need to disable the operating system''s LVM
startup code and leave it to heartbeat manage, if you buy into their
assumptions.  Might be worth a question or two on the LVM list to see if
the assumptions are valid or not -- or resign yourself to allowing
heartbeat to operate LVM resource ownership at the VG level and not LV
level.

Cheers,
b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091210/f28dbd17/attachment.bin

Jim Garlick

2009-Dec-10 20:46 UTC

head link

[Lustre-discuss] Heartbeat, LVM and Lustre

On Thu, Dec 10, 2009 at 03:29:30PM -0500, Brian J. Murrell
wrote:> On Fri, 2009-12-11 at 02:01 +1100, Atul Vidwansa wrote: 
> > 
> > When I reboot MDS nodes and start MDTs with "service 
> > heartbeat start" simultaneously on both mds nodes, sometimes I
get
> > following message:
> 
> With both nodes up and running at the same time, likely they have both
> done a vgscan; vgchange -a y on the shared disk(s).  I don''t know
that
> this is in itself a problem.  I do the same thing here and I have not
> (yet) seen any ill effects.  I am far from an LVM expert however.
> 
> > mds1: 2009/12/10_13:48:08 CRITICAL: Resource LVM::mgsvg is active, and
> > should not be!
> > mds1: 2009/12/10_13:48:08 CRITICAL: Non-idle resources can affect data
> > integrity!
> 
> I wonder how it''s determining that LVM:mgsvg is "active"
and what it
> considers "active".  A look into the source for that would most
likely
> be very fruitful.  And it was.
> 
> It seems that "/usr/lib/ocf/resource.d/heartbeat/LVM status" is
what is
> used to determine who owns the resource.  The LVM resource script does
> that with a:
> 
> vgdisplay [-v if lvm version is >= 2 ] $volume 2>&1 | grep -i
''Status[ \t]*available''
> 
> What is interesting is on my LVM 2 system, vgdisplay with -v also shows
> a:
> 
>   LV Status              available
> 
> for every volume in the VG.  I wonder if they are just not accounting
> for that.  Or maybe that''s what they are looking for given that on
my
> active and in use LVM system here, for the VG itself, Status shows:
> 
>   VG Status             resizable
> 
> So they can''t be looking for an "available" in the VG
Status for
> "resource ownership" and must want the LV Status line(s).
> 
> Looking a little further, the LVM script has both "start" and
"stop"
> actions which presumably heartbeat invokes to (dis-)"own" a
resource.
> These two actions do:
> 
> vgscan; vgchange -a y $1
> 
> and
> 
> vgchange -a n $1
> 
> respectively.  That implies that heartbeat wants to own an entire VG or
> nothing.  It would appear you cannot have multiple volumes from a single
> VG owned by different nodes.  As I said, I do this myself and have found
> no issues, but am not at all a heavy, or what I would call
"production"
> user.
> 
> > and heartbeat on both mds nodes does not start any resource (even
after
> > waiting for 35 minutes).
> 
> Well, it would seem that heartbeat has found a condition it considers
> dangerous and stopping there so as not to cause any damage.  From the
> looks of things, you will need to disable the operating system''s
LVM
> startup code and leave it to heartbeat manage, if you buy into their
> assumptions.  Might be worth a question or two on the LVM list to see if
> the assumptions are valid or not -- or resign yourself to allowing
> heartbeat to operate LVM resource ownership at the VG level and not LV
> level.
I suppose that using the LVM resource script implies that heartbeat owns
the resource and must start and stop it.  If that isn''t required, then
one
could always manage the Lustre server resource as we do when there is a
"real" shared block device that''s expected to appear on both
nodes.

I bet the lvm admin commands aren''t safe for that though.
> Cheers,
> b.
> 

> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://*lists.lustre.org/mailman/listinfo/lustre-discuss

Andreas Dilger

2009-Dec-10 20:52 UTC

head link

[Lustre-discuss] Heartbeat, LVM and Lustre

On 2009-12-10, at 13:29, Brian J. Murrell wrote:> Looking a little further, the LVM script has both "start" and
"stop"
> actions which presumably heartbeat invokes to (dis-)"own" a
resource.
> These two actions do:
>
> vgscan; vgchange -a y $1
>
> and
>
> vgchange -a n $1
>
> respectively.  That implies that heartbeat wants to own an entire VG  
> or
> nothing.  It would appear you cannot have multiple volumes from a  
> single
> VG owned by different nodes.  As I said, I do this myself and have  
> found
> no issues, but am not at all a heavy, or what I would call  
> "production"
> user.
A VG is like a filesystem in that regard, even though they layout  
changes much less frequently.  If two nodes had a VG imported, and  
then one did a grow of an LV (let''s say a raw volume for simplicity)  
the allocation of that volume would consume PEs from the VG, which  
changes the layout on disk.  The node that did the resize would  
reflect the new size, but the other node has no reason to re-read the  
VG layout from disk and would see the old size and PE allocation  
maps.  If it resized a different LV, it would lead to corruption of  
the VG.
>> and heartbeat on both mds nodes does not start any resource (even  
>> after
>> waiting for 35 minutes).
>
> Well, it would seem that heartbeat has found a condition it considers
> dangerous and stopping there so as not to cause any damage.  From the
> looks of things, you will need to disable the operating system''s
LVM
> startup code and leave it to heartbeat manage, if you buy into their
> assumptions.  Might be worth a question or two on the LVM list to  
> see if
> the assumptions are valid or not -- or resign yourself to allowing
> heartbeat to operate LVM resource ownership at the VG level and not LV
> level.

No, the heartbeat code is correct.  The whole VG should be under  
control of the HA agent, unless you are using the clustered LVM  
extensions that Red Hat wrote for GFS2.  I''m not sure if they are  
public or not, but in any case, since Lustre/ldiskfs expects sole  
ownership of the LVs (and the filesystems therein) there isn''t any  
benefit to having them imported on 2 nodes at once, but a lot of risk.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Brian J. Murrell

2009-Dec-10 21:05 UTC

head link

[Lustre-discuss] Heartbeat, LVM and Lustre

On Thu, 2009-12-10 at 13:52 -0700, Andreas Dilger wrote:
> 
> A VG is like a filesystem in that regard, even though they layout  
> changes much less frequently.  If two nodes had a VG imported, and  
> then one did a grow of an LV (let''s say a raw volume for
simplicity)
> the allocation of that volume would consume PEs from the VG, which  
> changes the layout on disk.  The node that did the resize would  
> reflect the new size, but the other node has no reason to re-read the  
> VG layout from disk and would see the old size and PE allocation  
> maps.  If it resized a different LV, it would lead to corruption of  
> the VG.
Yeah.  As an afterthought I had wondered about that.  I only recall ever
going through resize operations with my shared VG implementation here
once and even then I did it all from one node and more than likely did a
vgscan on the other node when I was done.
> No, the heartbeat code is correct.
Agreed.
> The whole VG should be under  
> control of the HA agent, unless you are using the clustered LVM  
> extensions that Red Hat wrote for GFS2.
Yeah.  clvmd it seems.
> since Lustre/ldiskfs expects sole  
> ownership of the LVs
Individual LVs though.
> (and the filesystems therein) there isn''t any  
> benefit to having them imported on 2 nodes at once, but a lot of risk.
Cluster-awaring LVM aside, there is if you had one volume group with all
of the OSTs for two OSSes in it.  Of course, you could isolate your VGs
on a per OSS basis but then you start to lose the flexibility LVM brings
to the table and also still limit yourself from migrating individual
OSTs between OSSes.

FWIW, there does seem to be some indication from google that clvmd is
FOSS.  There is a debian bug report about an initscript for their clvmd
package.  There is a clvm package in Ubuntu as well.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091210/a5a4852a/attachment.bin

Brian J. Murrell

2009-Dec-10 21:15 UTC

head link

[Lustre-discuss] Heartbeat, LVM and Lustre

On Thu, 2009-12-10 at 12:46 -0800, Jim Garlick wrote: > 
> I suppose that using the LVM resource script implies that heartbeat owns
> the resource and must start and stop it.
Indeed.  That is my understanding too, and further, if heartbeat finds a
resource already running on a node on which it''s trying to start it
stops and throws it''s hands up.  When the O/S is starting LVM, both
nodes end up doing that.

I suppose ultimately this is not unlike not having the lustre devices
in /etc/fstab (set to automount, anyway) when using heartbeat because
you know heartbeat is going to do the mount.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091210/987ec361/attachment-0001.bin

Adam Gandelman

2009-Dec-10 22:35 UTC

head link

[Lustre-discuss] Heartbeat, LVM and Lustre

Brian J. Murrell wrote:> Indeed.  That is my understanding too, and further, if heartbeat finds a
> resource already running on a node on which it''s trying to start
it
> stops and throws it''s hands up.  When the O/S is starting LVM,
both
> nodes end up doing that.
>   One of many limitations in heartbeat v1 clusters.  Pacemaker and, IIRC,
heartbeat2 crm will attempt to stop an overactive resource when it
notices it is active somewhere it shouldn''t be.  Also, if a stop
request
fails to clear up the confusion and ensure it is running on just one
node: STONITH.  V1 falls short in this case, too.  hb2/pacemaker HA
clusters do more to avoid throwing its hands up and accepting no
availability.   All the more reason why anyone going HA in production
should brave the learning curve and update to something current.


-- 
: Adam Gandelman
: LINBIT | Your Way to High Availability
: Telephone: 503-573-1262 ext. 203
: Sales: 1-877-4-LINBIT / 1-877-454-6248
:
: 7959 SW Cirrus Dr.
: Beaverton, OR 97008
:
: http://www.linbit.com

Jim Garlick

2009-Dec-11 00:07 UTC

head link

[Lustre-discuss] Heartbeat, LVM and Lustre

On Thu, Dec 10, 2009 at 02:35:38PM -0800, Adam Gandelman
wrote:> Brian J. Murrell wrote:
> > Indeed.  That is my understanding too, and further, if heartbeat finds
a
> > resource already running on a node on which it''s trying to
start it
> > stops and throws it''s hands up.  When the O/S is starting
LVM, both
> > nodes end up doing that.
> >   
> One of many limitations in heartbeat v1 clusters.  Pacemaker and, IIRC,
> heartbeat2 crm will attempt to stop an overactive resource when it
> notices it is active somewhere it shouldn''t be.  Also, if a stop
request
> fails to clear up the confusion and ensure it is running on just one
> node: STONITH.  V1 falls short in this case, too.  hb2/pacemaker HA
> clusters do more to avoid throwing its hands up and accepting no
> availability.   All the more reason why anyone going HA in production
> should brave the learning curve and update to something current.
Here we have a configuration problem and the right thing is probably
to throw up hands and make somebody fix it.  It could be dangerous to
have LVM start on both nodes, run for a while exposed to races, then
have heartbeat shut down one side and "just work".

But I see your point.
> -- 
> : Adam Gandelman
> : LINBIT | Your Way to High Availability
> : Telephone: 503-573-1262 ext. 203
> : Sales: 1-877-4-LINBIT / 1-877-454-6248
> :
> : 7959 SW Cirrus Dr.
> : Beaverton, OR 97008
> :
> : http://*www.*linbit.com 	  	
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://*lists.lustre.org/mailman/listinfo/lustre-discuss

Lustre discuss - Dec 2009 - Heartbeat, LVM and Lustre

[Lustre-discuss] Heartbeat, LVM and Lustre

[Lustre-discuss] Heartbeat, LVM and Lustre

[Lustre-discuss] Heartbeat, LVM and Lustre

[Lustre-discuss] Heartbeat, LVM and Lustre

[Lustre-discuss] Heartbeat, LVM and Lustre

[Lustre-discuss] Heartbeat, LVM and Lustre

[Lustre-discuss] Heartbeat, LVM and Lustre

[Lustre-discuss] Heartbeat, LVM and Lustre

[Lustre-discuss] Heartbeat, LVM and Lustre