> Gluster (3.8.7) coped perfectly - no data loss, no maintenance required,
> each time it came up by itself with no hand holding and started healing
> nodes, which completed very quickly. VM's on gluster auto started with
> no problems, i/o load while healing was ok. I felt quite confident in it.
Glad to hear that part went well.
> The alternate cluster fs - not so good. Many times running VM's were
> corrupted, several times I lost the entire filesystem. Also IOPS where
> atrocious (fuse based). It easy to claim HA when you exclude such things
> as power supply failures, dodgy network switches etc.
Too true. Unfortunately, I think just about every distributed storage
system has to go through this learning curve, from not handling failure
at all to handling the simplest/easiest cases to handling the weird stuff
that real deployments can throw at you. It's not just about the actual
failure handling, either. Sometimes, it's about things you do in the
main I/O path, such as not throwing away O_SYNC flags to claim better
performance. From the information you've provided, I'll bet that's
where
your data corruption came from.
> I think glusters active/active quorum based design, where is every node
> is a master is a winner, active/passive systems where you have a SPOF
> master are difficult to DR manage.
Active/passive designs create a very tough set of tradeoffs. Detecting
and responding to failures quickly enough, while also avoiding false
alarms, is like balancing on a knife edge. Then there's problems with
overload turning into failure, with failback, etc. It can all be done
right and work well, but it's *really* hard. While I guess it's better
than nothing, experience has shown that active/active designs are easier
to make robust, and the techniques for doing so have been well known for
at least a decade or so.
> However :) Things I'd really like to see in Gluster:
>
> - More flexible/easier management of servers and bricks
(add/remove/replace)
>
> - More flexible replication rules
>
> One of the things I really *really* like with LizardFS is the powerful
> goal system and chunkservers. Nodes and disks can be trivially easily
> added/removed on the fly and chunks will be shuffled, replicated or
> deleted to balance the system. Individual objects can have difference
> goals (replication levels) which can also be changed on the fly and the
> system will rebalance them. Objects can even be changed from/to simple
> replication to Erasure Encoded objects.
>
> I doubt this could be fitted to the existing gluster, but is there
> potential for this sort of thing in Gluster 4.0? I read the design docs
> and they look ambitious.
There used to be an idea called "data classification" to cover this
kind of case. You're right that setting arbitrary goals for arbitrary
objects would be too difficult. However, we could have multiple pools
with different replication/EC strategies, then use a translator like
the one for tiering to control which objects go into which pools based
on some kind of policy. To support that with a relatively small
number of nodes/bricks we'd also need to be able to split bricks into
smaller units, but that's not really all that hard.
Unfortunately, although many of these ideas have been around for at
least a year and a half, nobody has ever been freed up to work on
them. Maybe, with all of the interest in multi-tenancy to support
containers and hyperconvergence and whatever else, we might finally
be able to get these under way.