Experts, I am trying to use Linux heartbeat (2.1.4 with v1 style resource configuration) with LVM to mount Lustre MDTs. My configuration is simple, ha.cf and haresources file is attached. I have an interesting observation. When I reboot MDS nodes and start MDTs with "service heartbeat start" simultaneously on both mds nodes, sometimes I get following message: mds1: 2009/12/10_13:48:08 CRITICAL: Resource LVM::mgsvg is active, and should not be! mds1: 2009/12/10_13:48:08 CRITICAL: Non-idle resources can affect data integrity! mds2: 2009/12/10_13:47:08 CRITICAL: Resource LVM::mgsvg is active, and should not be! mds2: 2009/12/10_13:47:08 CRITICAL: Non-idle resources can affect data integrity! and heartbeat on both mds nodes does not start any resource (even after waiting for 35 minutes). Has anyone seen this before? /etc/ha.d/ha.cf : ===========use_logd on logfile /var/log/ha-log debugfile /var/log/ha-debug logfacility local0 keepalive 2 deadtime 120 warntime 10 initdead 120 udpport 694 mcast eth0 239.0.0.3 694 1 0 mcast ib0 224.0.0.3 694 1 0 node mds1 node mds2 auto_failback off stonith_host mds1 external/ipmi mds2 mds2-sp root changeme lanplus stonith_host mds2 external/ipmi mds1 mds1-sp root changeme lanplus /etc/ha.d/haresources : ================mds1 LVM::mgsvg Filesystem::/dev/mgsvg/mgs::/lustre/mgs::lustre mds1 LVM::home1vg Filesystem::/dev/home1vg/home1::/lustre/home1::lustre mds1 LVM::data1vg Filesystem::/dev/data1vg/data1::/lustre/data1::lustre mds2 LVM::flushvg Filesystem::/dev/flushvg/flush::/lustre/flush::lustre mds2 LVM::data2vg Filesystem::/dev/data2vg/data2::/lustre/data2::lustre mds2 LVM::home2vg Filesystem::/dev/home2vg/home2::/lustre/home2::lustre Cheers, _Atul
I''m hoping an actual expert will respond to your question and educate us all, but in the mean time I had a thought: is device-mapper starting your volume groups on both systems, and then the LVM resource agent script is correctly noticing this? Maybe you need to configure LVM not to start automatically and instead wait for the resource agent script to do it. Jim On Fri, Dec 11, 2009 at 02:01:24AM +1100, Atul Vidwansa wrote:> Experts, > > I am trying to use Linux heartbeat (2.1.4 with v1 style resource > configuration) with LVM to mount Lustre MDTs. My configuration is > simple, ha.cf and haresources file is attached. I have an interesting > observation. When I reboot MDS nodes and start MDTs with "service > heartbeat start" simultaneously on both mds nodes, sometimes I get > following message: > > mds1: 2009/12/10_13:48:08 CRITICAL: Resource LVM::mgsvg is active, and > should not be! > mds1: 2009/12/10_13:48:08 CRITICAL: Non-idle resources can affect data > integrity! > > > mds2: 2009/12/10_13:47:08 CRITICAL: Resource LVM::mgsvg is active, and > should not be! > mds2: 2009/12/10_13:47:08 CRITICAL: Non-idle resources can affect data > integrity! > > > and heartbeat on both mds nodes does not start any resource (even after > waiting for 35 minutes). > Has anyone seen this before? > > > /etc/ha.d/ha.cf : > ===========> use_logd on > logfile /var/log/ha-log > debugfile /var/log/ha-debug > logfacility local0 > keepalive 2 > deadtime 120 > warntime 10 > initdead 120 > udpport 694 > mcast eth0 239.0.0.3 694 1 0 > mcast ib0 224.0.0.3 694 1 0 > node mds1 > node mds2 > auto_failback off > stonith_host mds1 external/ipmi mds2 mds2-sp root changeme lanplus > stonith_host mds2 external/ipmi mds1 mds1-sp root changeme lanplus > > /etc/ha.d/haresources : > ================> mds1 LVM::mgsvg Filesystem::/dev/mgsvg/mgs::/lustre/mgs::lustre > mds1 LVM::home1vg Filesystem::/dev/home1vg/home1::/lustre/home1::lustre > mds1 LVM::data1vg Filesystem::/dev/data1vg/data1::/lustre/data1::lustre > mds2 LVM::flushvg Filesystem::/dev/flushvg/flush::/lustre/flush::lustre > mds2 LVM::data2vg Filesystem::/dev/data2vg/data2::/lustre/data2::lustre > mds2 LVM::home2vg Filesystem::/dev/home2vg/home2::/lustre/home2::lustre > > Cheers, > _Atul > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://*lists.lustre.org/mailman/listinfo/lustre-discuss
On Fri, 2009-12-11 at 02:01 +1100, Atul Vidwansa wrote:> > When I reboot MDS nodes and start MDTs with "service > heartbeat start" simultaneously on both mds nodes, sometimes I get > following message:With both nodes up and running at the same time, likely they have both done a vgscan; vgchange -a y on the shared disk(s). I don''t know that this is in itself a problem. I do the same thing here and I have not (yet) seen any ill effects. I am far from an LVM expert however.> mds1: 2009/12/10_13:48:08 CRITICAL: Resource LVM::mgsvg is active, and > should not be! > mds1: 2009/12/10_13:48:08 CRITICAL: Non-idle resources can affect data > integrity!I wonder how it''s determining that LVM:mgsvg is "active" and what it considers "active". A look into the source for that would most likely be very fruitful. And it was. It seems that "/usr/lib/ocf/resource.d/heartbeat/LVM status" is what is used to determine who owns the resource. The LVM resource script does that with a: vgdisplay [-v if lvm version is >= 2 ] $volume 2>&1 | grep -i ''Status[ \t]*available'' What is interesting is on my LVM 2 system, vgdisplay with -v also shows a: LV Status available for every volume in the VG. I wonder if they are just not accounting for that. Or maybe that''s what they are looking for given that on my active and in use LVM system here, for the VG itself, Status shows: VG Status resizable So they can''t be looking for an "available" in the VG Status for "resource ownership" and must want the LV Status line(s). Looking a little further, the LVM script has both "start" and "stop" actions which presumably heartbeat invokes to (dis-)"own" a resource. These two actions do: vgscan; vgchange -a y $1 and vgchange -a n $1 respectively. That implies that heartbeat wants to own an entire VG or nothing. It would appear you cannot have multiple volumes from a single VG owned by different nodes. As I said, I do this myself and have found no issues, but am not at all a heavy, or what I would call "production" user.> and heartbeat on both mds nodes does not start any resource (even after > waiting for 35 minutes).Well, it would seem that heartbeat has found a condition it considers dangerous and stopping there so as not to cause any damage. From the looks of things, you will need to disable the operating system''s LVM startup code and leave it to heartbeat manage, if you buy into their assumptions. Might be worth a question or two on the LVM list to see if the assumptions are valid or not -- or resign yourself to allowing heartbeat to operate LVM resource ownership at the VG level and not LV level. Cheers, b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091210/f28dbd17/attachment.bin
On Thu, Dec 10, 2009 at 03:29:30PM -0500, Brian J. Murrell wrote:> On Fri, 2009-12-11 at 02:01 +1100, Atul Vidwansa wrote: > > > > When I reboot MDS nodes and start MDTs with "service > > heartbeat start" simultaneously on both mds nodes, sometimes I get > > following message: > > With both nodes up and running at the same time, likely they have both > done a vgscan; vgchange -a y on the shared disk(s). I don''t know that > this is in itself a problem. I do the same thing here and I have not > (yet) seen any ill effects. I am far from an LVM expert however. > > > mds1: 2009/12/10_13:48:08 CRITICAL: Resource LVM::mgsvg is active, and > > should not be! > > mds1: 2009/12/10_13:48:08 CRITICAL: Non-idle resources can affect data > > integrity! > > I wonder how it''s determining that LVM:mgsvg is "active" and what it > considers "active". A look into the source for that would most likely > be very fruitful. And it was. > > It seems that "/usr/lib/ocf/resource.d/heartbeat/LVM status" is what is > used to determine who owns the resource. The LVM resource script does > that with a: > > vgdisplay [-v if lvm version is >= 2 ] $volume 2>&1 | grep -i ''Status[ \t]*available'' > > What is interesting is on my LVM 2 system, vgdisplay with -v also shows > a: > > LV Status available > > for every volume in the VG. I wonder if they are just not accounting > for that. Or maybe that''s what they are looking for given that on my > active and in use LVM system here, for the VG itself, Status shows: > > VG Status resizable > > So they can''t be looking for an "available" in the VG Status for > "resource ownership" and must want the LV Status line(s). > > Looking a little further, the LVM script has both "start" and "stop" > actions which presumably heartbeat invokes to (dis-)"own" a resource. > These two actions do: > > vgscan; vgchange -a y $1 > > and > > vgchange -a n $1 > > respectively. That implies that heartbeat wants to own an entire VG or > nothing. It would appear you cannot have multiple volumes from a single > VG owned by different nodes. As I said, I do this myself and have found > no issues, but am not at all a heavy, or what I would call "production" > user. > > > and heartbeat on both mds nodes does not start any resource (even after > > waiting for 35 minutes). > > Well, it would seem that heartbeat has found a condition it considers > dangerous and stopping there so as not to cause any damage. From the > looks of things, you will need to disable the operating system''s LVM > startup code and leave it to heartbeat manage, if you buy into their > assumptions. Might be worth a question or two on the LVM list to see if > the assumptions are valid or not -- or resign yourself to allowing > heartbeat to operate LVM resource ownership at the VG level and not LV > level.I suppose that using the LVM resource script implies that heartbeat owns the resource and must start and stop it. If that isn''t required, then one could always manage the Lustre server resource as we do when there is a "real" shared block device that''s expected to appear on both nodes. I bet the lvm admin commands aren''t safe for that though.> Cheers, > b. >> _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://*lists.lustre.org/mailman/listinfo/lustre-discuss
On 2009-12-10, at 13:29, Brian J. Murrell wrote:> Looking a little further, the LVM script has both "start" and "stop" > actions which presumably heartbeat invokes to (dis-)"own" a resource. > These two actions do: > > vgscan; vgchange -a y $1 > > and > > vgchange -a n $1 > > respectively. That implies that heartbeat wants to own an entire VG > or > nothing. It would appear you cannot have multiple volumes from a > single > VG owned by different nodes. As I said, I do this myself and have > found > no issues, but am not at all a heavy, or what I would call > "production" > user.A VG is like a filesystem in that regard, even though they layout changes much less frequently. If two nodes had a VG imported, and then one did a grow of an LV (let''s say a raw volume for simplicity) the allocation of that volume would consume PEs from the VG, which changes the layout on disk. The node that did the resize would reflect the new size, but the other node has no reason to re-read the VG layout from disk and would see the old size and PE allocation maps. If it resized a different LV, it would lead to corruption of the VG.>> and heartbeat on both mds nodes does not start any resource (even >> after >> waiting for 35 minutes). > > Well, it would seem that heartbeat has found a condition it considers > dangerous and stopping there so as not to cause any damage. From the > looks of things, you will need to disable the operating system''s LVM > startup code and leave it to heartbeat manage, if you buy into their > assumptions. Might be worth a question or two on the LVM list to > see if > the assumptions are valid or not -- or resign yourself to allowing > heartbeat to operate LVM resource ownership at the VG level and not LV > level.No, the heartbeat code is correct. The whole VG should be under control of the HA agent, unless you are using the clustered LVM extensions that Red Hat wrote for GFS2. I''m not sure if they are public or not, but in any case, since Lustre/ldiskfs expects sole ownership of the LVs (and the filesystems therein) there isn''t any benefit to having them imported on 2 nodes at once, but a lot of risk. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
On Thu, 2009-12-10 at 13:52 -0700, Andreas Dilger wrote:> > A VG is like a filesystem in that regard, even though they layout > changes much less frequently. If two nodes had a VG imported, and > then one did a grow of an LV (let''s say a raw volume for simplicity) > the allocation of that volume would consume PEs from the VG, which > changes the layout on disk. The node that did the resize would > reflect the new size, but the other node has no reason to re-read the > VG layout from disk and would see the old size and PE allocation > maps. If it resized a different LV, it would lead to corruption of > the VG.Yeah. As an afterthought I had wondered about that. I only recall ever going through resize operations with my shared VG implementation here once and even then I did it all from one node and more than likely did a vgscan on the other node when I was done.> No, the heartbeat code is correct.Agreed.> The whole VG should be under > control of the HA agent, unless you are using the clustered LVM > extensions that Red Hat wrote for GFS2.Yeah. clvmd it seems.> since Lustre/ldiskfs expects sole > ownership of the LVsIndividual LVs though.> (and the filesystems therein) there isn''t any > benefit to having them imported on 2 nodes at once, but a lot of risk.Cluster-awaring LVM aside, there is if you had one volume group with all of the OSTs for two OSSes in it. Of course, you could isolate your VGs on a per OSS basis but then you start to lose the flexibility LVM brings to the table and also still limit yourself from migrating individual OSTs between OSSes. FWIW, there does seem to be some indication from google that clvmd is FOSS. There is a debian bug report about an initscript for their clvmd package. There is a clvm package in Ubuntu as well. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091210/a5a4852a/attachment.bin
On Thu, 2009-12-10 at 12:46 -0800, Jim Garlick wrote:> > I suppose that using the LVM resource script implies that heartbeat owns > the resource and must start and stop it.Indeed. That is my understanding too, and further, if heartbeat finds a resource already running on a node on which it''s trying to start it stops and throws it''s hands up. When the O/S is starting LVM, both nodes end up doing that. I suppose ultimately this is not unlike not having the lustre devices in /etc/fstab (set to automount, anyway) when using heartbeat because you know heartbeat is going to do the mount. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091210/987ec361/attachment-0001.bin
Brian J. Murrell wrote:> Indeed. That is my understanding too, and further, if heartbeat finds a > resource already running on a node on which it''s trying to start it > stops and throws it''s hands up. When the O/S is starting LVM, both > nodes end up doing that. >One of many limitations in heartbeat v1 clusters. Pacemaker and, IIRC, heartbeat2 crm will attempt to stop an overactive resource when it notices it is active somewhere it shouldn''t be. Also, if a stop request fails to clear up the confusion and ensure it is running on just one node: STONITH. V1 falls short in this case, too. hb2/pacemaker HA clusters do more to avoid throwing its hands up and accepting no availability. All the more reason why anyone going HA in production should brave the learning curve and update to something current. -- : Adam Gandelman : LINBIT | Your Way to High Availability : Telephone: 503-573-1262 ext. 203 : Sales: 1-877-4-LINBIT / 1-877-454-6248 : : 7959 SW Cirrus Dr. : Beaverton, OR 97008 : : http://www.linbit.com
On Thu, Dec 10, 2009 at 02:35:38PM -0800, Adam Gandelman wrote:> Brian J. Murrell wrote: > > Indeed. That is my understanding too, and further, if heartbeat finds a > > resource already running on a node on which it''s trying to start it > > stops and throws it''s hands up. When the O/S is starting LVM, both > > nodes end up doing that. > > > One of many limitations in heartbeat v1 clusters. Pacemaker and, IIRC, > heartbeat2 crm will attempt to stop an overactive resource when it > notices it is active somewhere it shouldn''t be. Also, if a stop request > fails to clear up the confusion and ensure it is running on just one > node: STONITH. V1 falls short in this case, too. hb2/pacemaker HA > clusters do more to avoid throwing its hands up and accepting no > availability. All the more reason why anyone going HA in production > should brave the learning curve and update to something current.Here we have a configuration problem and the right thing is probably to throw up hands and make somebody fix it. It could be dangerous to have LVM start on both nodes, run for a while exposed to races, then have heartbeat shut down one side and "just work". But I see your point.> -- > : Adam Gandelman > : LINBIT | Your Way to High Availability > : Telephone: 503-573-1262 ext. 203 > : Sales: 1-877-4-LINBIT / 1-877-454-6248 > : > : 7959 SW Cirrus Dr. > : Beaverton, OR 97008 > : > : http://*www.*linbit.com > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://*lists.lustre.org/mailman/listinfo/lustre-discuss