Rudi Ahlers
2010-Oct-21 08:09 UTC
[Gluster-users] I'm new to Gluster, and have some questions
Hi all, I'm considering setting up Gluster, and have a few questions if you don't mind. 1. Which option is better? I already have a few CentOS 5.5. server setup. Would it be better to just install GlusterFS, or to install Gluster Storage Platform from scratch? How / where can I see a full comparison between the 2? Are there any performance / management benefits in choosing the one of the other? 2. I need reliability and speed. From what I understand, I could setup 2 servers to work similar to software RAID1 (mirroring). Is it also correct to assume that I could use 4 servers in a RAID10 / 1+0 type setup? But then obviously serverA & serverB will be mirrored, and serverC & serverD together? What happens to the data? Does it get filled randomly between the 2 sets of servers, or does it get put onto serverA & B first, till it's full then move over to C & D? 3. Has anyone noticed any considerable differences in using 1x 1GB NIC & 2x 1GB NIC's bonded together? Or should I rather use a Quad port NIC if / where possible? 4. How do clients (i.e. users) connect if I want to give them normal FTP / SMB / NFS access? Or do I need to mount the exported Gluster to another Linux server first which runs these services already? 5. If there's 10 Gluster servers, for example, with a lot of data spread out across them. How do the clients connect, exactly? I.e. do they all connect to a central server which then just "fetches and delivers" the content to the clients, or do the client's connect directly to the specific server where their content is? i.e. is the network traffic split evenly across the servers, according to where the data is stored? tia :) -- Kind Regards Rudi Ahlers SoftDux Website: http://www.SoftDux.com Technical Blog: http://Blog.SoftDux.com Office: 087 805 9573 Cell: 082 554 7532
Daniel Mons
2010-Oct-22 00:03 UTC
[Gluster-users] I'm new to Gluster, and have some questions
On Thu, Oct 21, 2010 at 6:09 PM, Rudi Ahlers <Rudi at softdux.com> wrote:> 1. Which option is better? I already have a few CentOS 5.5. server > setup. Would it be better to just install GlusterFS, or to install > Gluster Storage Platform from scratch? How / where can I see a full > comparison between the 2? Are there any performance / management > benefits in choosing the one of the other?Gluster Storage Platform is near zero effort to set up. Literally boot from the provided USB stick image, and follow your nose. From there, all setup is via a GUI, and it's easy to see what's going on for novices. The downside is for all of that GUI management stuff, you lose a lot of low-level control (and IMHO understanding of what's going on). So the trade off there is whether you want a graphical management tool where a lot of the "black magic" is hidden, or whether you want to roll up your sleeves and control the system yourself. As a long-time Linux sysadmin, I prefer the GlusterFS option on a Linux distro of my choice. Pretty GUIs are nice for Windows and VMWare users who generally fear keyboards, but give me a CLI (and SSH access!) any day. Personal preference caveat emptor.> 2. I need reliability and speed. From what I understand, I could setup > 2 servers to work similar to software RAID1 (mirroring). Is it also > correct to assume that I could use 4 servers in a RAID10 / 1+0 type > setup? But then obviously serverA & serverB will be mirrored, and > serverC & serverD together? What happens to the data? Does it get > filled randomly between the 2 sets of servers, or does it get put onto > serverA & B first, till it's full then move over to C & D?There's no right and wrong here. You can set up individual disks as bricks if you like (multiple bricks per server), or you can LVM/JBOD them up and present one big brick per node, or you can use RAID per node. Performance and reliability ratios of each really are up to your own personal need. Speaking for myself, I currently use RAID5 for nodes with 6 or fewer disks, and RAID6 for 7 or more disks, presenting a single logical storage brick per node regardless of how many physical disks are in each. My GlusterFS setups are replicate+stripe across the whole cluster. So there's multiple levels of redundancy (within the node, and within the cluster) which lets me sleep easier at night. As for how the data gets shuffled about inside GlusterFS, that depends really on how you've set it up. For distributed data, there are various thresholds you can set to make sure that once a limit is hit, data will be written to other servers by preference. Obviously with replicate and stripe that isn't so much of an option, as technically data will roughly fill all nodes evenly. A lot of the technical detail (including how Gluster chooses nodes) is covered in the doco: http://www.gluster.com/community/documentation/index.php/Translators/cluster> 3. Has anyone noticed any considerable differences in using 1x 1GB NIC > & 2x 1GB NIC's bonded together? Or should I rather use a Quad port NIC > if / where possible?Simply put, the more NICs the better. If you start to get a lot of clients hitting the storage, you really want a lot of bandwidth to serve it. Plus 2 NICs per box give you redundancy as well, which is an added plus. A quad port NIC per node could get costly once you add up switch ports and the like. Depending on your vendor of choice, the jump to 10GbE may be worth it. It's probably also worth remembering that you need to make sure your disk can feed the network well enough. With 8 commodity 1TB SATA 7200RPM disks and Linux software RAID6, I get about 500MB/s serial reads on a single node (verified by both "dd" and "bonnie++"). That's enough to saturate 2x 1GbE cards, but 10GbE would probably be a waste. If I had larger storage systems (SAS/FC 10K or 15K RPM drives, or even SAN-backed storage), then 10GbE or even Infiniband would start to come into consideration.> 4. How do clients (i.e. users) connect if I want to give them normal > FTP / SMB / NFS access? Or do I need to mount the exported Gluster to > another Linux server first which runs these services already?Yes, you need other services in front of GlusterFS. These don't necessarily need to be on separate machines - there's nothing stopping you running Samba/NFS/whatever on one of the Gluster nodes with a locally mounted GlusterFS and re-exporting from there. Obviously that means something else could potentially eat into the performance of that node, which is something to consider on large sites. Remember too that you can spread your services. If you have 4 GlusterFS nodes, you could put Samba/NFS/whatever on all 4, and via some scripting (or even network/DNS/VLAN) magic ensure that all of the users/machines in your org are spread somewhat evenly across all four nodes. That also means that if a single Samba/NFS/whatever server/service dies, only part of your network goes down, and affected users/machines could be migrated to other systems quickly. There are still advantages to having GlusterFS-backed storage even with "legacy" file sharing protocols in place over the top.> 5. If there's 10 Gluster servers, for example, with a lot of data > spread out across them. How do the clients connect, exactly? I.e. do > they all connect to a central server which then just "fetches and > delivers" the content to the clients, or do the client's connect > directly to the specific server where their content is? i.e. is the > network traffic split evenly across the servers, according to where > the data is stored?Some explanation here: http://www.youtube.com/watch?v=EbJFWBkQpZ8 The client side of Gluster does a lot of work to decentralise the system. There's no "master node" per se, and the Global Name Space allows the client to see all servers at once: http://en.wikipedia.org/wiki/Global_Namespace This is quite a bit different to clusters of old, but the advantage is native Gluster clients will fetch data direct from the storage node that has it (or if it's striped/replicated, then it will split across them via various load balancing algorithms defined by you - these can be "least busy", "round robin" and others). There's no need for native Gluster clients to fetch data through a single master node, which alleviates bottlenecks. Obviously this isn't the case for people accessing data through Samba/NFS in front of Gluster, but as before there are things you can do to spread that load and network traffic as well. The whole concept of Gluster is very clever, and makes a lot of sense. The huge advantage of all of it, of course, is that for every node you add, you're also adding bandwidth to the overall cluster. This is the polar opposite of traditional centralised storage systems (SANs, etc) where adding storage blocks reduces the average bandwidth per client, making performance worse as you scale (don't say that to a SAN vendor though, because they'll get very upset and red faced, as it's the dirty little secret of the SAN business). Particularly for sites that require consistent storage growth over time (and lets face it - who doesn't?), Gluster a fantastic idea. Let's just say that traditional SAN and NAS solutions are now at the bottom of my shopping list when it comes to storage rollouts for business technology infrastructure I'm in charge of designing. -Dan
Horacio Sanson
2010-Oct-22 00:55 UTC
[Gluster-users] I'm new to Gluster, and have some questions
I am just starting playing with Gluster but I think I can give you some answers from my experience. On Thursday 21 October 2010 17:09:32 Rudi Ahlers wrote:> Hi all, > > I'm considering setting up Gluster, and have a few questions if you don't > mind. > > > 1. Which option is better? I already have a few CentOS 5.5. server > setup. Would it be better to just install GlusterFS, or to install > Gluster Storage Platform from scratch? How / where can I see a full > comparison between the 2? Are there any performance / management > benefits in choosing the one of the other? >The Gluster Storage Platform requires GlusterFS. The platform is a complete OS (linux Fedora) + GlusterFS + Web Management in a single package that can be installed via USB in a few minutes. It is supposed to simplify installation, setup and management of GlusterFS clusters but.... I could not get it to work properly. I was unable to add new servers. Everytime I pressed the add new server button I got an error saying "Could not retrive installer ip address". And since the platform is relative new there is near zero documentation/issue reports about it. Also adding the servers/volumes via command line never reflected to the web based GUI So I installed Ubuntu 10.10 LTS and GlusterFS 3.1 via source code and handling the server/volumes etc via the new command line is a breeze.> 2. I need reliability and speed. From what I understand, I could setup > 2 servers to work similar to software RAID1 (mirroring). Is it also > correct to assume that I could use 4 servers in a RAID10 / 1+0 type > setup? But then obviously serverA & serverB will be mirrored, and > serverC & serverD together? What happens to the data? Does it get > filled randomly between the 2 sets of servers, or does it get put onto > serverA & B first, till it's full then move over to C & D? >I only have two servers for testing. What you setup are volumes and each volume can be configured depending on your needs. This is what I understand so far: Distributed volume: Aggregates the storage of several directories (bricks in gluster terms) among several computers. The benefit is that you can grow/shrink the volume as you please. The bad part is that this offers no performance/reliability guarantees as files are stored randomly among the disks in the volume. Replicated volume: Requires minimum 2 bricks in separate servers. All files are replicated among the bricks. How many replicas can be configured at volume creation. Has all the benefits of a Distributed volume plus fail resilience. Stripe volume: Requires minimum 2 bricks in separate servers. All files are splitted in stripes and these stripes are distributed among the bricks of the volume. How many stripes and which size is configured on volume creation. Has all the benefits of Replicated volume plus reliability and can improve read performance for large files as the read is distributed among several machines.> 3. Has anyone noticed any considerable differences in using 1x 1GB NIC > & 2x 1GB NIC's bonded together? Or should I rather use a Quad port NIC > if / where possible? > > 4. How do clients (i.e. users) connect if I want to give them normal > FTP / SMB / NFS access? Or do I need to mount the exported Gluster to > another Linux server first which runs these services already? >Gluster 3.1 has a native NFS v3 implementation so you can mount any Gluster volume as a normal NFS mount. For SMB you need to configure samba to share the volume and you can easily access the files on any of the bricks via SCP or FTP if you have an SSH or FTP server configured. For linux the recommended way is to use the glusterfs module to mount as a gluster file system.> 5. If there's 10 Gluster servers, for example, with a lot of data > spread out across them. How do the clients connect, exactly? I.e. do > they all connect to a central server which then just "fetches and > delivers" the content to the clients, or do the client's connect > directly to the specific server where their content is? i.e. is the > network traffic split evenly across the servers, according to where > the data is stored? >This is also something I would like to know. When connecting clients I use the command mount -t [nfs|glusterfs] <ip-address>:<volume-name> /mount/point where ip-address is the IP of any of the servers that have the volume configured. It is not clear to me how the reliability part works here. If I disconnect the server with that ip-address I loose access to the files. True that the files are still accessible via other servers but I need to manually set the mount to point to another server which is not exactly high- availability.> tia :)-- regards, Horacio Sanson