Hi Folks, I find myself trying to expand a 2-node high-availability cluster from to a 4-node cluster. I'm running Xen virtualization, and currently using DRBD to mirror data, and pacemaker to failover cleanly. The thing is, I'm trying to add 2 nodes to the cluster, and DRBD doesn't scale. Also, as a function of rackspace limits, and the hardware at hand, I can't separate storage nodes from compute nodes - instead, I have to live with 4 nodes, each with 4 large drives (but also w/ 4 gigE ports per server). The obvious thought is to use Gluster to assemble all the drives into one large storage pool, with replication. But.. last time I looked at this (6 months or so back), it looked like some of the critical features were brand new, and performance seemed to be a problem in the configuration I'm thinking of. Which leads me to my question: Has the situation improved to the point that I can use Gluster this way? Thanks very much, Miles Fidelman -- In theory, there is no difference between theory and practice. In practice, there is. .... Yogi Berra
It would probably be better to ask this with end-goal questions instead of with a unspecified "critical feature" list and "performance problems". 6 months ago, for myself and quite an extensive (and often impressive) list of users there were no missing critical features nor was there any problems with performance. That's not to say that they did not meet your design specifications, but without those specs you're the only one who could evaluate that. On 12/26/2012 08:24 PM, Miles Fidelman wrote:> Hi Folks, > > I find myself trying to expand a 2-node high-availability cluster from > to a 4-node cluster. I'm running Xen virtualization, and currently > using DRBD to mirror data, and pacemaker to failover cleanly. > > The thing is, I'm trying to add 2 nodes to the cluster, and DRBD > doesn't scale. Also, as a function of rackspace limits, and the > hardware at hand, I can't separate storage nodes from compute nodes - > instead, I have to live with 4 nodes, each with 4 large drives (but > also w/ 4 gigE ports per server). > > The obvious thought is to use Gluster to assemble all the drives into > one large storage pool, with replication. But.. last time I looked at > this (6 months or so back), it looked like some of the critical > features were brand new, and performance seemed to be a problem in the > configuration I'm thinking of. > > Which leads me to my question: Has the situation improved to the > point that I can use Gluster this way? > > Thanks very much, > > Miles Fidelman > >
On 12-12-26 10:24 PM, Miles Fidelman wrote:> Hi Folks, > > I find myself trying to expand a 2-node high-availability cluster from > to a 4-node cluster. I'm running Xen virtualization, and currently > using DRBD to mirror data, and pacemaker to failover cleanly. > > The thing is, I'm trying to add 2 nodes to the cluster, and DRBD > doesn't scale. Also, as a function of rackspace limits, and the > hardware at hand, I can't separate storage nodes from compute nodes - > instead, I have to live with 4 nodes, each with 4 large drives (but > also w/ 4 gigE ports per server). > > The obvious thought is to use Gluster to assemble all the drives into > one large storage pool, with replication. But.. last time I looked at > this (6 months or so back), it looked like some of the critical > features were brand new, and performance seemed to be a problem in the > configuration I'm thinking of. > > Which leads me to my question: Has the situation improved to the > point that I can use Gluster this way? > > Thanks very much, > > Miles Fidelman > >Hi, I have a XenServer pool (3 servers) talking to an GlusterFS replicate server over NFS with uCARP for IP failover. The system was put in place in May 2012, using GlusterFS 3.3. It ran very well, with speeds comparable to my existing iSCSI solution (http://majentis.com/2011/09/21/xenserver-iscsi-and-glusterfsnfs/ I was quite pleased with the system, it worked flawlessly. Until November. At that point, the Gluster NFS server started stalling under load. It would become unresponsive for a long enough period of time that the VM's under XenServer would lose their drives. Linux would remount the drives read-only and then eventually lock up, while Windows would just lock up. In this case, Windows was more resilient to the transient disk loss. I have been unable to solve the problem, and am now switching back to a DRBD/iSCSI setup. I'm not happy about it, but we were losing NFS connectively nightly, during backups. Life was hell for a long time while I was trying to fix things. Gerald -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20121227/7c0e3f7d/attachment.html>
On Wed, Dec 26, 2012 at 11:24:25PM -0500, Miles Fidelman wrote:> I find myself trying to expand a 2-node high-availability cluster > from to a 4-node cluster. I'm running Xen virtualization, and > currently using DRBD to mirror data, and pacemaker to failover > cleanly.Not answering your question directly, but have you looked at Ganeti? This is a front-end to Xen+LVM+DRBD (open source, written by Google) which makes it easy to manage such a cluster, assuming DRBD is meeting your needs well at the moment. With Ganeti each VM image is its own logical volume, with its own DRBD instance sitting on top, so you can have different VMs mirrored between different pairs of machines. You can migrate storage, albeit slowly (e.g. starting with A mirrored to B, you can break the mirroring then re-mirror A to C, and then mirror C to D). Ganeti automates all this for you. Another option to look at is Sheepdog, which is a clustered block-storage device, but this would require you to switch from Xen to KVM.> and performance seemed to be a > problem in the configuration I'm thinking of.With KVM at least, last time I tried performance was still very poor when a VM image was being written to a file over gluster - I measured about 6MB/s. However remember that each VM can directly mount glusterfs volumes internally, and the performance of this is fine - and it also means you can share data between the VMs. So with some rearchitecture of your application you may get sufficient performance for your needs. Regards, Brian.
Look, fuse its issues that we all know about. Either it works for you or it doesn't. If fuse bothers you that much, look into libgfapi. Re: NFS - I'm trying to help track this down. Please either add your comment to an existing bug or create a new ticket. Either way, ranting won't solve your problem or inspire anyone to fix it. -JM Stephan von Krawczynski <skraw at ithnet.com> wrote: On Wed, 26 Dec 2012 22:04:09 -0800 Joe Julian <joe at julianfamily.org> wrote:> It would probably be better to ask this with end-goal questions instead > of with a unspecified "critical feature" list and "performance problems". > > 6 months ago, for myself and quite an extensive (and often impressive) > list of users there were no missing critical features nor was there any > problems with performance. That's not to say that they did not meet your > design specifications, but without those specs you're the only one who > could evaluate that.Well, then the list of users does obviously not contain me ;-) The damn thing will only become impressive if a native kernel client module is done. FUSE is really a pain. And read my lips: the NFS implementation has general load/performance problems. Don't be surprised if it jumps into your face. Why on earth do they think linux has NFS as kernel implementation? -- Regards, Stephan _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
I am hopeful that 3.4 will go much further in this regard. At this point, when anyone asks me about VM image management, I tell them it works for some and not for others. I've seen enough bad outcomes to not recommend it in all cases, but I've also seen enough good outcomes to not discount it out-of-hand either. My answer now is the same as it has been: use at your own risk. But we've made much progress, and the recent qemu integration and libgfapi is a continuation of that. In general, I don't recommend any distributed filesystems for VM images, but I can also see that this is the wave of the future. -JM Miles Fidelman <mfidelman at meetinghouse.net> wrote: Dan Cyr wrote:> > Miles - As is right now GlusterFS is not what you want for backend VM > storage. > > Question: ?how well will this work? > > Answer: ?horribly? > >Ok... that's the kind of answer I was looking for (though a disappointing one). Thanks, Miles -- In theory, there is no difference between theory and practice. In practice, there is. .... Yogi Berra _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Fidelman,> Let's say that I take a slightly looser approach to high-availability: > - keep the static parts of my installs on local disk > - share and replicate dynamic data using gluster > - failover by rebooting on a different node (no image to worry about > migrating) > > In this scenario, how well does gluster work when: > - storage and processing are inter-mixed on the same nodesHave you checked out GFS? If your hardware are IPMI capable, your configuration is a perfect candidate for GFS. It actually is also far reliable. I have not used it in production, but have set it up and played around with it. Also on their mailing list and they have good words for it Glusterfs is far better for a detached storage that need scaling, cheap (Don't have IPMI capable servers) and also redundant. I am have having high CPU utilization/slow writes on the clients end but the server side is very solid. Will keep testing and watching the mailing list going forward> - data is triply replicated (allow for 2-node failures) > > Miles Fidelman > > -- > In theory, there is no difference between theory and practice. > In practice, there is. .... Yogi Berra > > >William
Joe,> I have 3 servers with replica 3 volumes, 4 bricks per server on lvm > partitions that are placed on each of 4 hard drives, 15 volumes > resulting in 60 bricks per server. One of my servers is also a kvm host > running (only) 24 vms. >Mind explaining your setup again. I kind of could not follow, probably because of terminology issues. For example 4 bricks per server - Don't understand this part, I assumes a brick == 1 physical server (Okay, could also be one vm, but don't see how that would be help unless its a test environment). The way you put it though, mean I have issues with my terminology. Isn't there a 1:1 relationship between brick and server?> Each vm image is only 6 gig, enough for the operating system and > applications and is hosted on one volume. The data for each application > is hosted on its own GlusterFS volume.Hmm, petty good idea, especially security wise. Means one VM can not mess with another vm files. Is it possible to extend gluster volume without destroying and recreating it with bigger peer storage setting> > For mysql, I set up my innodb store to use 4 files (I don't do 1 file > per table), each file distributes to each of the 4 replica subvolumes. > This balances the load pretty nicely.I thought lots of small files would be better than 4 huge files? I mean, why does this work out better performance wise? Not saying its wrong, I am just trying to learn from you as I am looking for a similar setup. However, I could not think why using 4 files would be better but this may because I don't understand how glusterfs works may be> > I don't really do anything special for anything else, other than the php > app recommendations I make on my blog (http://joejulian.name) which all > have nothing to do with the actual filesystem. >Thanks for the link> The thing that I think some people (even John Mark) miss apply is that > this is just a tool. You have to engineer a solution using the tools you > have available. If you feel the positives that GlusterFS provides > outweigh the negatives, then you will simply have to engineer a solution > that suits your end goal using this tool. It's not a question of whether > it works, it's whether you can make it work for your use case. > > On 12/27/2012 03:00 PM, Miles Fidelman wrote: >> Ok... now that's diametrically the opposite response from Dan Cyr's of >> a few minutes ago.William
Thanks Joe,>> Isn't there a 1:1 relationship between brick and server? > In my configuration, 1 server has 4 drives (well, 5, but one's the OS). > Each drive has one gpt partition. I create an lvm volume group that > holds all four huge partitions. For any one GlusterFS volume I create 4 > lvm logical volumes: > > lvcreate -n a_vmimages clustervg /dev/sda1 > lvcreate -n b_vmimages clustervg /dev/sdb1 > lvcreate -n c_vmimages clustervg /dev/sdc1 > lvcreate -n d_vmimages clustervg /dev/sdd1 > > then format them xfs and (I) mount them under > /data/glusterfs/vmimages/{a,b,c,d}. These four lvm partitions are bricks > for the new GlusterFS volume.Followed, actually, going to redo it this way, but will or a RAID instead of individual drive. Thanks> > As glusterbot would say if asked for the glossary: >> A "server" hosts "bricks" (ie. server1:/foo) which belong to a >> "volume" which is accessed from a "client". >Yes, checked the manual glossary and its well explained. Had yet to read those last pages> My volume would then look like > gluster volume create replica 3 > server{1,2,3}:/data/glusterfs/vmimages/a/brick > server{1,2,3}:/data/glusterfs/vmimages/b/brick > server{1,2,3}:/data/glusterfs/vmimages/c/brick > server{1,2,3}:/data/glusterfs/vmimages/d/brick >>> Each vm image is only 6 gig, enough for the operating system and >>> applications and is hosted on one volume. The data for each application >>> is hosted on its own GlusterFS volume. >> Hmm, petty good idea, especially security wise. Means one VM can not >> mess with another vm files. Is it possible to extend gluster volume >> without destroying and recreating it with bigger peer storage setting > I can do that two ways. I can add servers with storage and then > add-brick to expand, or I can resize the lvm partitions and grow xfs > (which I have done live several times).Will be going with lvm, now that I understand what is a brick> >>> For mysql, I set up my innodb store to use 4 files (I don't do 1 file >>> per table), each file distributes to each of the 4 replica subvolumes. >>> This balances the load pretty nicely. > It's not so much a "how glusterfs works" question as much as it is a how > innodb works question. By configuring the innodb_data_file_path to start > with a multiple of your bricks (and carefully choosing some filenames to > ensure they're distributed evenly), records seem to be (and I only have > tested this through actual use and have no idea if this is how it's > supposed to work) accessed evenly over the distribute set. >Hmm, have you checked on the gluster servers that these four files are in separate bricks? As far as I understand, if you have not done anything Glusterfs scheduler (Default ALU on version 3.3), it is likely that is not whats happening. Or you are using a version that has a different scheduler. Interesting though. Poke around and update us please Thanks William