Hi all, I am a grad student setting up a new cluster in our research group. We already have five nodes each with 5 x 1 TB disks in a RAID-5 array. Currently we just export this disk using NFS (/cluster/data0[1-5]). This is already kind of bothersome because one needs to remember which of five NFS mounts contains a dataset of interest. Now we are getting four new nodes with faster disk (12 x 600 GB array, each @ 15K RPM), and would like to merge these (at least) into a global filesystem, and even possibly add the existing disk. GlusterFS looks very promising, especially because it doesn't need to take over the filesystem, and the configuration looks relatively simple (compared to GPFS or Lustre). However, I am having trouble tracking down a detailed explanation of how it works, so that I can see where the weak-points are. The installation guide on the Wiki was a good starting point to get a very basic understanding, but I am totally unaware of a detailed explanation of configuration options &c. Does some sort of manual exist? Also, how robust is GlusterFS? We probably want to stripe the data to improve performance, but if a server dies, does the file catalogue go with it, resulting in total data loss? Or does the meta-data get replicated somehow so that one can recover the partial files? Any helpers, including pointers to existing configurations that I can learn from would be ideal. kind regards, Doug Schouten p.s. to describe our needs more fully, our data-sets consist of many files on the order of 100 - 200 MB in size. Typically we write files once (retrieve from a central collaboration server) and read many times as we tune an analysis, so read speed is much more important than write performance. Redundancy is not a huge concern, since most of our data is replicated at remote sites anyway ... although stability is still a consideration because re-fetching the data takes O(days). The machines are connected by dual-bond 1Gb ethernet. Latency is probably not an issue since they are all connected on an internal switch in the same rack.
On 02/09/2011 08:36 PM, Doug Schouten wrote:> Also, how robust is GlusterFS? We probably want to stripe the data to > improve performance, but if a server dies, does the file catalogue go > with it, resulting in total data loss? Or does the meta-data get > replicated somehow so that one can recover the partial files?There is no central catalog or metadata server. All GlusterFS data and metadata is pretty directly reflected in data and metadata on the servers' local filesystems, so *in general* you could take GlusterFS entirely out of the picture and trivially reconstruct a unified view just by copying all of those local filesystems into one place. There are two notable exceptions, though: * If you use DHT/distribute, each file will exist as a complete local file on one "brick" but there might also be "linkfiles" (zero length, sticky bit set, distinctive xattrs) in the same place on other bricks. If you were to attempt "all into one" recovery as described above, you'd have to exclude the linkfiles or else they might overwrite (truncate) the real files. * If you use N-way striping, each file will exist as N files on N bricks. Each of these files will be non-zero-length but will also contain only the data for 1/N blocks of the file; the rest will be "holes" that read as zero but are actually unallocated space. There *is* information attached to each file (as xattrs) that identifies which stripe component it is. Recovery in this case would require reading that information and using "dd" or similar to reassemble the N files back into one, so it's a little more tedious than the non-striping case but not prohibitively difficult. As you can see, recovery is pretty simple, but it's also important to keep in mind what happens between the time a server dies and when you recover. If you're using simple DHT and relying on RAID (instead of GlusterFS replication) for data protection, you're still vulnerable to failure of non-storage components on a server. If such a failure were to happen, then 1/N of your files - for all practical purposes at random - would become inaccessible. I've also seen problems with creating new files which would be assigned to that "gap" in the hash space that is how DHT distributes data, though most of these seem to have been fixed in 3.1 or later. It's very disconcerting when it happens. My recommendation would be to plan for migration to a scheme where each server exposes two smaller "bricks" (which might still use RAID internally) with GlusterFS replication between bricks on different servers to protect fully against this kind of failure.