I''m experimenting setting up san and live migrations. Running debian squeeze and usually create domains with xentools although I will consider other possibilities. Saw a tutorial iscsi + lvm. Setup a iscsi target for use as a volume group. I made it into a volume group and as far as that seems to work fine. What are my possibilities for destruction :-) Like to test the limits now when nothing that really matters is on the vg. If I create a new logical volume on one node it shows up on the other node without and issues. Should I be using clvm? I tried to install it but it seems to want a setup in /etc/cluster. Or should I do the sharing in a different way? John
On Fri, Jun 29, 2012 at 10:32 AM, John McMonagle <johnm@advocap.org> wrote:> Should I be using clvm?not necessarily; but it does have some advantages. in most VM setups you don''t use cluster filesystems (GFS, OCFS2, etc), so you should never mount the same volume on two machines. that extends to virtual machines too, so you must not start the same VM image on two hosts. most setups help you enforce this, and live migration makes sure the target VM isn''t resumed until the original VM is no longer working. so far, no need for any ''cluster'' setup. LVM reads all the physical volume, volume group and logical volume layouts (what it calls metadata) from the shared storage to RAM at startup and then works from there. It''s only rewritten to disk when modified (creating/deleting volumes, resizing them, adding PVs, etc). That means that if you start more than one box connected to the same PVs they''ll be able to reach the same VGs and LVs, and if you don''t modify anything, it would run perfectly. But, if you want to do any metadata change, you have to: 1) choose one machine to do the change 2) on every other machine, disconnect from the VG (vgchange -a n) 3) do any needed change on the only machine that''s still connected to the VG 4) reread the metadata on all machines (lvscan) if you have periodic planned downtimes, you can schedule things and work like this; but if can''t afford the few minutes it takes, you need clvm what clvm do is to use the ''suspend'' feature of the device mapper to make sure no process on no machine perform any access to the shared storage until the metadata changes have been propagated. roughly: 0) there''s a clvmd daemon running on all machines that have access to the VG. they use a distributed lock manager to keep in touch. 1) you do any LVM command that modifies metadata on any machine 2) the LVM command asks the clvmd process to acquire a distributed lock 3) to get that lock, all the clvmd deamons issue a dmsuspend. this doesn''t ''freeze'' the machine, only blocks any IO request on any LV member of the VG 4) when all other machines are suspended, the original clvmd has acquired the lock, and allows the LVM command to progress 5) when the LVM command is finished, it asks the clvmd to release the lock 6) to release the lock, the daemons in every other machine reread the LVM metadata (lvscan) and lifts the dmsuspend status 7) when all the machines are unsuspended, the LVM command returns to the CLI prompt, and everything is running again. as you can see, it''s the same as the manual process, but since it all happens in a few miliseconds, the ''other'' machines can be just suspended instead of having to be really brought down. i guess it could also be done with a global script that spreads the ''dmsuspend / wait / lvscan / dmresume'' commands; but by the time you get it to run reliably, you''ve replicated the shared lock functionality. -- Javier
On Friday 29 June 2012 3:29:47 pm Javier Guerra Giraldez wrote:> On Fri, Jun 29, 2012 at 10:32 AM, John McMonagle <johnm@advocap.org> wrote: > > Should I be using clvm? > > not necessarily; but it does have some advantages. > > in most VM setups you don''t use cluster filesystems (GFS, OCFS2, etc), > so you should never mount the same volume on two machines. that > extends to virtual machines too, so you must not start the same VM > image on two hosts. most setups help you enforce this, and live > migration makes sure the target VM isn''t resumed until the original VM > is no longer working. > > so far, no need for any ''cluster'' setup. > > LVM reads all the physical volume, volume group and logical volume > layouts (what it calls metadata) from the shared storage to RAM at > startup and then works from there. It''s only rewritten to disk when > modified (creating/deleting volumes, resizing them, adding PVs, etc). > > That means that if you start more than one box connected to the same > PVs they''ll be able to reach the same VGs and LVs, and if you don''t > modify anything, it would run perfectly. But, if you want to do any > metadata change, you have to: > > 1) choose one machine to do the change > 2) on every other machine, disconnect from the VG (vgchange -a n) > 3) do any needed change on the only machine that''s still connected to the > VG 4) reread the metadata on all machines (lvscan) > > if you have periodic planned downtimes, you can schedule things and > work like this; but if can''t afford the few minutes it takes, you need > clvm > > what clvm do is to use the ''suspend'' feature of the device mapper to > make sure no process on no machine perform any access to the shared > storage until the metadata changes have been propagated. roughly: > > 0) there''s a clvmd daemon running on all machines that have access to > the VG. they use a distributed lock manager to keep in touch. > 1) you do any LVM command that modifies metadata on any machine > 2) the LVM command asks the clvmd process to acquire a distributed lock > 3) to get that lock, all the clvmd deamons issue a dmsuspend. this > doesn''t ''freeze'' the machine, only blocks any IO request on any LV > member of the VG > 4) when all other machines are suspended, the original clvmd has > acquired the lock, and allows the LVM command to progress > 5) when the LVM command is finished, it asks the clvmd to release the lock > 6) to release the lock, the daemons in every other machine reread the > LVM metadata (lvscan) and lifts the dmsuspend status > 7) when all the machines are unsuspended, the LVM command returns to > the CLI prompt, and everything is running again. > > as you can see, it''s the same as the manual process, but since it all > happens in a few miliseconds, the ''other'' machines can be just > suspended instead of having to be really brought down. > > i guess it could also be done with a global script that spreads the > ''dmsuspend / wait / lvscan / dmresume'' commands; but by the time you > get it to run reliably, you''ve replicated the shared lock > functionality.Thanks for the information. I have been trying to get clvm working but it''s dependent on cman and have not had much luck figuring out cluster.conf. I see the new version in testing has no dependency on cman so I think I''ll try that one. Looks like will have to upgrade lvm2 also. I have seen references saying that you can not do snapshots. And others saying the the snapshot and snapshotted volume have to be on one node. That I can live with. Is that the case? John
On Sat, Jun 30, 2012 at 7:06 PM, John McMonagle <johnm@advocap.org> wrote:> I have been trying to get clvm working but it''s dependent on cman and have not > had much luck figuring out cluster.conf. > > I see the new version in testing has no dependency on cman so I think I''ll try > that one. > Looks like will have to upgrade lvm2 also. > > I have seen references saying that you can not do snapshots. > And others saying the the snapshot and snapshotted volume have to be on one > node. That I can live with. Is that the case?Since you''re using an iscsi target anyway, why not just setup a separate LUN for each VM? That you don''t have to worry about LVM on the dom0 nodes. -- Fajar
Am Samstag, den 30.06.2012, 19:13 +0700 schrieb Fajar A. Nugraha:> On Sat, Jun 30, 2012 at 7:06 PM, John McMonagle <johnm@advocap.org> wrote: > > > I have been trying to get clvm working but it''s dependent on cman and have not > > had much luck figuring out cluster.conf. > > > > I see the new version in testing has no dependency on cman so I think I''ll try > > that one. > > Looks like will have to upgrade lvm2 also. > > > > I have seen references saying that you can not do snapshots. > > And others saying the the snapshot and snapshotted volume have to be on one > > node. That I can live with. Is that the case? > > Since you''re using an iscsi target anyway, why not just setup a > separate LUN for each VM? That you don''t have to worry about LVM on > the dom0 nodes. >I''ve tried that, but was never able to figure out how a smoothly working multipath setup could be done for direct LUN/VM mapping. After some xm/xl create and migrate either dm-multipath or open-iscsi always had issues with locked devices which in most cases couldn''t be freed. I assume better scripting skills for my xen/scripts/block-* attach/detach scripts could''ve mitigate that. Additionally, using a fixed set of LUNs connected via fixed amount of paths over fixed dedicated links and combined as PVs together to VGs gave me the chance of setting any tcp and iscsi (target as well as initiator) value to ideally fit to latency and bandwidth. Having different kind of Disks and Raid behind the LUNs gives me the possibility of offering different service tiers with guaranteed throughput and predictable latency to each dom0. For future setups clvm will be my first choice again, only the lack of snapshots for clustered VGs is a pain. Having 1:1 LUN/VM as you suggest, it''s up to your storages if (and how) snapshots can be triggered, so if snapshots are required, one can''t use clvm. At least one could get rid of the "c" and use lvm without any locking mechanism, but when it comes to lvresize, lvcreate -s etc. reloading the lvm layout on every dom0 becomes mandatory. I wouldn''t recommend lvm without clustering, as it''s extremely easy to get out of layout sync which include potential data loss (requires grande cojones...) By now, I''m running different test scenarios with plain ocfs2 and tap:aio backends. Performance is not too bad, but I recently managed to i/o deadlock my dom0s on high domU i/o demands. Tracking that down, kept me from trying lvm+ocfs2 combination and other weird scenarios ;) My advice to John would be: learn the very basics of cman and friends (RHEL6 and derivates offers very nice tutorials, as a starting point: look for luci and ricci) and try gfs, ocfs2 and or clvm. cheers, Stephan