Christopher Hawkins
2010-May-13 12:23 UTC
[Gluster-users] Best Practices for Gluster Replication
I have followed that debate before. The impression I got was that if you handle it in the clients, then they will be able to fail from one server to the next if the original goes down (after the timeout). But if you handle it in the server side, then if a server goes down the only way to get HA for the clients is to externally implement round robin DNS or something like that. Other than this issue, I think either way is technically acceptable. If memory serves, this is the reason why client side is the "default" or preferred setup. Chris ----- "James Burnash" <jburnash at knight.com> wrote:> I know it's only been a day, and I understand that people are busy - > but nobody has anything to share on this subject? > > It seems like it would be good thing to be able to at least understand > why implementing it on the back end would be a bad idea ... > > -----Original Message----- > From: gluster-users-bounces at gluster.org > [mailto:gluster-users-bounces at gluster.org] On Behalf Of Burnash, > James > Sent: Wednesday, May 12, 2010 10:30 AM > To: gluster-users at gluster.org > Subject: [Gluster-users] Best Practices for Gluster Replication > > Greetings List, > > I've searched through the Gluster wiki and a lot of threads to try to > answer this question, but so far no real luck. > > Simply put - is it better to have replication handled by the clients, > or by the bricks themselves? > > Volgen for a raid 1 solution creates a config file that does the > mirroring on the client side - which I would take as an implicit > endorsement from the Gluster team (great team, BTW). However, it seems > to me that if the bricks replicated between themselves on our 10Gb > storage network, it could save a lot of bandwidth for the clients and > conceivably save them CPU cycles an I/O as well. > > Client machines have 1Gb connections to the storage network, and are > running CentOS 5.2. > Server machines have 10Gb connections to the storage network, and are > running CentOS 5.4. > > Glusterfs.vol: > ## file auto generated by /usr/bin/glusterfs-volgen (mount.vol) > # Cmd line: > # $ /usr/bin/glusterfs-volgen --name testfs --raid 1 > jc1letgfs13-pfs1:/export/read-write > jc1letgfs14-pfs1:/export/read-write > jc1letgfs15-pfs1:/export/read-write > jc1letgfs16-pfs1:/export/read-write > jc1letgfs17-pfs1:/export/read-write > jc1letgfs18-pfs1:/export/read-write > > # RAID 1 > # TRANSPORT-TYPE tcp > volume jc1letgfs17-pfs1-1 > type protocol/client > option transport-type tcp > option remote-host jc1letgfs17-pfs1 > option transport.socket.nodelay on > option transport.remote-port 6996 > option remote-subvolume brick1 > end-volume > > volume jc1letgfs18-pfs1-1 > type protocol/client > option transport-type tcp > option remote-host jc1letgfs18-pfs1 > option transport.socket.nodelay on > option transport.remote-port 6996 > option remote-subvolume brick1 > end-volume > > volume jc1letgfs13-pfs1-1 > type protocol/client > option transport-type tcp > option remote-host jc1letgfs13-pfs1 > option transport.socket.nodelay on > option transport.remote-port 6996 > option remote-subvolume brick1 > end-volume > > volume jc1letgfs15-pfs1-1 > type protocol/client > option transport-type tcp > option remote-host jc1letgfs15-pfs1 > option transport.socket.nodelay on > option transport.remote-port 6996 > option remote-subvolume brick1 > end-volume > > volume jc1letgfs16-pfs1-1 > type protocol/client > option transport-type tcp > option remote-host jc1letgfs16-pfs1 > option transport.socket.nodelay on > option transport.remote-port 6996 > option remote-subvolume brick1 > end-volume > > volume jc1letgfs14-pfs1-1 > type protocol/client > option transport-type tcp > option remote-host jc1letgfs14-pfs1 > option transport.socket.nodelay on > option transport.remote-port 6996 > option remote-subvolume brick1 > end-volume > > volume mirror-0 > type cluster/replicate > subvolumes jc1letgfs13-pfs1-1 jc1letgfs14-pfs1-1 > end-volume > > volume mirror-1 > type cluster/replicate > subvolumes jc1letgfs15-pfs1-1 jc1letgfs16-pfs1-1 > end-volume > > volume mirror-2 > type cluster/replicate > subvolumes jc1letgfs17-pfs1-1 jc1letgfs18-pfs1-1 > end-volume > > volume distribute > type cluster/distribute > subvolumes mirror-0 mirror-1 mirror-2 > end-volume > > volume readahead > type performance/read-ahead > option page-count 4 > subvolumes distribute > end-volume > > volume iocache > type performance/io-cache > option cache-size `echo $(( $(grep 'MemTotal' /proc/meminfo | sed > 's/[^0-9]//g') / 5120 ))`MB > option cache-timeout 1 > subvolumes readahead > end-volume > > volume quickread > type performance/quick-read > option cache-timeout 1 > option max-file-size 64kB > subvolumes iocache > end-volume > > volume writebehind > type performance/write-behind > option cache-size 4MB > subvolumes quickread > end-volume > > volume statprefetch > type performance/stat-prefetch > subvolumes writebehind > end-volume > > Glusterfsd.vol: > ## file auto generated by /usr/bin/glusterfs-volgen (export.vol) > # Cmd line: > # $ /usr/bin/glusterfs-volgen --name testfs > jc1letgfs13-pfs1:/export/read-write > jc1letgfs14-pfs1:/export/read-write > jc1letgfs15-pfs1:/export/read-write > > volume posix1 > type storage/posix > option directory /export/read-write > end-volume > > volume locks1 > type features/locks > subvolumes posix1 > end-volume > > volume brick1 > type performance/io-threads > option thread-count 8 > subvolumes locks1 > end-volume > > volume server-tcp > type protocol/server > option transport-type tcp > option auth.addr.brick1.allow * > option transport.socket.listen-port 6996 > option transport.socket.nodelay on > subvolumes brick1 > end-volume > > James Burnash > > > DISCLAIMER: > This e-mail, and any attachments thereto, is intended only for use by > the addressee(s) named herein and may contain legally privileged > and/or confidential information. If you are not the intended recipient > of this e-mail, you are hereby notified that any dissemination, > distribution or copying of this e-mail, and any attachments thereto, > is strictly prohibited. If you have received this in error, please > immediately notify me and permanently delete the original and any copy > of any e-mail and any printout thereof. E-mail transmission cannot be > guaranteed to be secure or error-free. The sender therefore does not > accept liability for any errors or omissions in the contents of this > message which arise as a result of e-mail transmission. > NOTICE REGARDING PRIVACY AND CONFIDENTIALITY Knight Capital Group may, > at its discretion, monitor and review the content of all e-mail > communications. http://www.knight.com > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Burnash, James
2010-May-13 13:43 UTC
[Gluster-users] Best Practices for Gluster Replication
Chris, Excellent, and thanks - that was exactly what I was looking for. James -----Original Message----- From: gluster-users-bounces at gluster.org [mailto:gluster-users-bounces at gluster.org] On Behalf Of Christopher Hawkins Sent: Thursday, May 13, 2010 8:23 AM To: gluster-users Subject: Re: [Gluster-users] Best Practices for Gluster Replication I have followed that debate before. The impression I got was that if you handle it in the clients, then they will be able to fail from one server to the next if the original goes down (after the timeout). But if you handle it in the server side, then if a server goes down the only way to get HA for the clients is to externally implement round robin DNS or something like that. Other than this issue, I think either way is technically acceptable. If memory serves, this is the reason why client side is the "default" or preferred setup. Chris ----- "James Burnash" <jburnash at knight.com> wrote:> I know it's only been a day, and I understand that people are busy - > but nobody has anything to share on this subject? > > It seems like it would be good thing to be able to at least understand > why implementing it on the back end would be a bad idea ... > > -----Original Message----- > From: gluster-users-bounces at gluster.org > [mailto:gluster-users-bounces at gluster.org] On Behalf Of Burnash, > James > Sent: Wednesday, May 12, 2010 10:30 AM > To: gluster-users at gluster.org > Subject: [Gluster-users] Best Practices for Gluster Replication > > Greetings List, > > I've searched through the Gluster wiki and a lot of threads to try to > answer this question, but so far no real luck. > > Simply put - is it better to have replication handled by the clients, > or by the bricks themselves? > > Volgen for a raid 1 solution creates a config file that does the > mirroring on the client side - which I would take as an implicit > endorsement from the Gluster team (great team, BTW). However, it seems > to me that if the bricks replicated between themselves on our 10Gb > storage network, it could save a lot of bandwidth for the clients and > conceivably save them CPU cycles an I/O as well. > > Client machines have 1Gb connections to the storage network, and are > running CentOS 5.2. > Server machines have 10Gb connections to the storage network, and are > running CentOS 5.4. > > Glusterfs.vol: > ## file auto generated by /usr/bin/glusterfs-volgen (mount.vol) > # Cmd line: > # $ /usr/bin/glusterfs-volgen --name testfs --raid 1 > jc1letgfs13-pfs1:/export/read-write > jc1letgfs14-pfs1:/export/read-write > jc1letgfs15-pfs1:/export/read-write > jc1letgfs16-pfs1:/export/read-write > jc1letgfs17-pfs1:/export/read-write > jc1letgfs18-pfs1:/export/read-write > > # RAID 1 > # TRANSPORT-TYPE tcp > volume jc1letgfs17-pfs1-1 > type protocol/client > option transport-type tcp > option remote-host jc1letgfs17-pfs1 > option transport.socket.nodelay on > option transport.remote-port 6996 > option remote-subvolume brick1 > end-volume > > volume jc1letgfs18-pfs1-1 > type protocol/client > option transport-type tcp > option remote-host jc1letgfs18-pfs1 > option transport.socket.nodelay on > option transport.remote-port 6996 > option remote-subvolume brick1 > end-volume > > volume jc1letgfs13-pfs1-1 > type protocol/client > option transport-type tcp > option remote-host jc1letgfs13-pfs1 > option transport.socket.nodelay on > option transport.remote-port 6996 > option remote-subvolume brick1 > end-volume > > volume jc1letgfs15-pfs1-1 > type protocol/client > option transport-type tcp > option remote-host jc1letgfs15-pfs1 > option transport.socket.nodelay on > option transport.remote-port 6996 > option remote-subvolume brick1 > end-volume > > volume jc1letgfs16-pfs1-1 > type protocol/client > option transport-type tcp > option remote-host jc1letgfs16-pfs1 > option transport.socket.nodelay on > option transport.remote-port 6996 > option remote-subvolume brick1 > end-volume > > volume jc1letgfs14-pfs1-1 > type protocol/client > option transport-type tcp > option remote-host jc1letgfs14-pfs1 > option transport.socket.nodelay on > option transport.remote-port 6996 > option remote-subvolume brick1 > end-volume > > volume mirror-0 > type cluster/replicate > subvolumes jc1letgfs13-pfs1-1 jc1letgfs14-pfs1-1 > end-volume > > volume mirror-1 > type cluster/replicate > subvolumes jc1letgfs15-pfs1-1 jc1letgfs16-pfs1-1 > end-volume > > volume mirror-2 > type cluster/replicate > subvolumes jc1letgfs17-pfs1-1 jc1letgfs18-pfs1-1 > end-volume > > volume distribute > type cluster/distribute > subvolumes mirror-0 mirror-1 mirror-2 > end-volume > > volume readahead > type performance/read-ahead > option page-count 4 > subvolumes distribute > end-volume > > volume iocache > type performance/io-cache > option cache-size `echo $(( $(grep 'MemTotal' /proc/meminfo | sed > 's/[^0-9]//g') / 5120 ))`MB > option cache-timeout 1 > subvolumes readahead > end-volume > > volume quickread > type performance/quick-read > option cache-timeout 1 > option max-file-size 64kB > subvolumes iocache > end-volume > > volume writebehind > type performance/write-behind > option cache-size 4MB > subvolumes quickread > end-volume > > volume statprefetch > type performance/stat-prefetch > subvolumes writebehind > end-volume > > Glusterfsd.vol: > ## file auto generated by /usr/bin/glusterfs-volgen (export.vol) > # Cmd line: > # $ /usr/bin/glusterfs-volgen --name testfs > jc1letgfs13-pfs1:/export/read-write > jc1letgfs14-pfs1:/export/read-write > jc1letgfs15-pfs1:/export/read-write > > volume posix1 > type storage/posix > option directory /export/read-write > end-volume > > volume locks1 > type features/locks > subvolumes posix1 > end-volume > > volume brick1 > type performance/io-threads > option thread-count 8 > subvolumes locks1 > end-volume > > volume server-tcp > type protocol/server > option transport-type tcp > option auth.addr.brick1.allow * > option transport.socket.listen-port 6996 > option transport.socket.nodelay on > subvolumes brick1 > end-volume > > James Burnash > > > DISCLAIMER: > This e-mail, and any attachments thereto, is intended only for use by > the addressee(s) named herein and may contain legally privileged > and/or confidential information. If you are not the intended recipient > of this e-mail, you are hereby notified that any dissemination, > distribution or copying of this e-mail, and any attachments thereto, > is strictly prohibited. If you have received this in error, please > immediately notify me and permanently delete the original and any copy > of any e-mail and any printout thereof. E-mail transmission cannot be > guaranteed to be secure or error-free. The sender therefore does not > accept liability for any errors or omissions in the contents of this > message which arise as a result of e-mail transmission. > NOTICE REGARDING PRIVACY AND CONFIDENTIALITY Knight Capital Group may, > at its discretion, monitor and review the content of all e-mail > communications. http://www.knight.com > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users_______________________________________________ Gluster-users mailing list Gluster-users at gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Marcus et all - Good discussion all around. A couple of points to clear up some of the terminology and a couple of architecture questions that haven't been answered. 1. The Gluster File System client is designed to be installed on devices that are consuming the storage. By installing the client here you get - 1a. Mirror on write, simultaneous writes to any number of mirrors. 2a. Storage server failure is transparent to your application. 3a. Significant performance benefits. 2. In a majority of installations the user uses the Gluster File System client wherever possible but often needs to access the Gluster Cluster via NFS or CIFS or some other NAS style protocol. Gluster is designed to support those needs. There are a some concepts that are important to understand Gluster's behavior when Gluster client isn't being used - 2a. Any file can be accessed from any node at any time. The physical location of the file is irrelevant. 2b. The entire distributed filesystem can be accessed by all protocols at the same time. 2c. Only the Gluster client can communicate with the Gluster server daemon. 2d. Only the Gluster client can mirror or replicate. 2e. The Gluster client can be installed on a Gluster server. 2f. Fundamental to NFS, CIFS, etc is the idea that their clients access a single IP address for storage. (Gluster client is a solution to this problem!) If the remote storage server that they have mounted fails they have no way to access the storage. 2g. The user is expected to provide some method of ensuring that when clients access the Gluster cluster via NFS et all that the number of connections to any one node are about the same as all the other nodes. The user is also expected to provide a method of ensuring that if a storage server fails the NFS, CIFS, etc client has the opportunity to connect to another storage server. Customers usually use RRDNS, UCARP, Haproxy, or enterprise load balancing hardware (F5, ACE, etc) for this IP failover / balancing layer. That sounds more complicated than it is. We install the Gluster client on the server, mount the distributed filesystem just like on any other host and then re-export that mount as NFS, CIFS, etc. We install that stack on every storage node. A user supplied layer on top of that balances inbound connections among the nodes. I've got a new pretty picture that tries to simplify some of this. It is a really rough draft, your feedback is appreciated. We (Gluster Inc) are working hard to find better ways to describe the big picture Gluster architecture to you, our users. Any ideas, language, concepts, pictures, questions you can't find the answers to, (42!) anything at all you think might help please send it my way! -- Craig Carl Gluster, Inc. Cell - (408) 829-9953 (California, USA) Gtalk - craig.carl at gmail.com ----- Original Message ----- From: "Marcus Bointon" <marcus at synchromedia.co.uk> To: "gluster-users at gluster.org Users" <gluster-users at gluster.org> Sent: Thursday, May 13, 2010 7:43:26 AM GMT -08:00 US/Canada Pacific Subject: Re: [Gluster-users] Best Practices for Gluster Replication On 13 May 2010, at 16:28, Burnash, James wrote:> I'm also not sure how I would go about setting this up with 2 NFS servers - would this be some sort of load balancing solution (using round robin DNS or an actual load balancer), or would this be implemented by having each NFS server responsible for only exporting a given portion of the whole Glusterfs backend storage.I'm not really sure of the best way to do it - NFS isn't really my thing. I assume that there are load balancing / failover solutions (haproxy, pound, heartbeat etc) that can deal with NFS - it would help if the balancer understood NFS at some kind of transactional level (as they can for HTTP). I would export each of the different gluster portions you want as separate NFS share points. Marcus -- Marcus Bointon Synchromedia Limited: Creators of http://www.smartmessages.net/ UK resellers of info at hand CRM solutions marcus at synchromedia.co.uk | http://www.synchromedia.co.uk/ _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Trying the attachment again, email is so complicated! Craig ----- Original Message ----- From: "Craig Carl" <craig at gluster.com> To: "Marcus Bointon" <marcus at synchromedia.co.uk> Cc: "gluster-users at gluster.org Users" <gluster-users at gluster.org> Sent: Thursday, May 13, 2010 10:01:11 PM GMT -08:00 US/Canada Pacific Subject: Re: [Gluster-users] Best Practices for Gluster Replication Marcus et all - Good discussion all around. A couple of points to clear up some of the terminology and a couple of architecture questions that haven't been answered. 1. The Gluster File System client is designed to be installed on devices that are consuming the storage. By installing the client here you get - 1a. Mirror on write, simultaneous writes to any number of mirrors. 2a. Storage server failure is transparent to your application. 3a. Significant performance benefits. 2. In a majority of installations the user uses the Gluster File System client wherever possible but often needs to access the Gluster Cluster via NFS or CIFS or some other NAS style protocol. Gluster is designed to support those needs. There are a some concepts that are important to understand Gluster's behavior when Gluster client isn't being used - 2a. Any file can be accessed from any node at any time. The physical location of the file is irrelevant. 2b. The entire distributed filesystem can be accessed by all protocols at the same time. 2c. Only the Gluster client can communicate with the Gluster server daemon. 2d. Only the Gluster client can mirror or replicate. 2e. The Gluster client can be installed on a Gluster server. 2f. Fundamental to NFS, CIFS, etc is the idea that their clients access a single IP address for storage. (Gluster client is a solution to this problem!) If the remote storage server that they have mounted fails they have no way to access the storage. 2g. The user is expected to provide some method of ensuring that when clients access the Gluster cluster via NFS et all that the number of connections to any one node are about the same as all the other nodes. The user is also expected to provide a method of ensuring that if a storage server fails the NFS, CIFS, etc client has the opportunity to connect to another storage server. Customers usually use RRDNS, UCARP, Haproxy, or enterprise load balancing hardware (F5, ACE, etc) for this IP failover / balancing layer. That sounds more complicated than it is. We install the Gluster client on the server, mount the distributed filesystem just like on any other host and then re-export that mount as NFS, CIFS, etc. We install that stack on every storage node. A user supplied layer on top of that balances inbound connections among the nodes. I've got a new pretty picture that tries to simplify some of this. It is a really rough draft, your feedback is appreciated. We (Gluster Inc) are working hard to find better ways to describe the big picture Gluster architecture to you, our users. Any ideas, language, concepts, pictures, questions you can't find the answers to, (42!) anything at all you think might help please send it my way! -- Craig Carl Gluster, Inc. Cell - (408) 829-9953 (California, USA) Gtalk - craig.carl at gmail.com ----- Original Message ----- From: "Marcus Bointon" To: "gluster-users at gluster.org Users" Sent: Thursday, May 13, 2010 7:43:26 AM GMT -08:00 US/Canada Pacific Subject: Re: [Gluster-users] Best Practices for Gluster Replication On 13 May 2010, at 16:28, Burnash, James wrote: > I'm also not sure how I would go about setting this up with 2 NFS servers - would this be some sort of load balancing solution (using round robin DNS or an actual load balancer), or would this be implemented by having each NFS server responsible for only exporting a given portion of the whole Glusterfs backend storage. I'm not really sure of the best way to do it - NFS isn't really my thing. I assume that there are load balancing / failover solutions (haproxy, pound, heartbeat etc) that can deal with NFS - it would help if the balancer understood NFS at some kind of transactional level (as they can for HTTP). I would export each of the different gluster portions you want as separate NFS share points. Marcus -- Marcus Bointon Synchromedia Limited: Creators of http://www.smartmessages.net/ UK resellers of info at hand CRM solutions marcus at synchromedia.co.uk | http://www.synchromedia.co.uk/ _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users