All -
A much more detailed update from engineering is below. The range of
affected versions is bigger that I thought, I apologize for not getting
that right earlier. Please let me know if you have any questions.
VERSIONS AFFECTED: All Current GlusterFS releases up to 3.1.5 and 3.2.1
SEVERITY: For Enomaly users, a conflict results in denial of service but
no loss of data. We have not observed the same race conditions in
non-Enomaly environments.
*CAUSE: Gluster and Enomaly incompatibility *
There is an incompatibility between GlusterFS and Enomaly and how they
perform directory operations. Enomaly's agents monitor node failures and
migrate VMs automatically. These distributed agents communicate with
each other by performing directory operations, such as mkdir, for the
purpose of inter-node communication. These directory operations triggers
a race condition GlusterFS, locking up a storage node. Enomaly agents
get confused and propagate the error across the site even for a single
node failure.
There was a race condition related to changing GFID's that was fixed in
3.1.5 - http://blog.gluster.com/2011/06/glusterfs-3-1-5-now-available/ -
this is a partial fix for the behavior described. Race conditions can
occur again.
After fixing the initial outage, if any node fails, you will see the
issue again. Upgrading GlusterFS to 3.1.5 and restarting GlusterFS and
Enomaly is a temporary fix. A permanent solution requires the Gluster
3.1.6 or 3.2.2 release (coming soon, see "Solution" below).
Other possible race conditions are fixed in the current source tree,
subject to further testing.
*SOLUTION:*
This issue has been fixed in our source repository
(https://github.com/gluster) and will be released soon with 3.1.6 and
3.2.2. If you'd like to help test the current fixes, please contact us
before you do anything foolish (read: use in production). Users who test
these patches in their non-critical, development environments and send
us feedback will each get a Gluster t-shirt, maybe even a hat!!
We will send out another alert as soon as both releases are GA.
--
Thanks,
Craig Carl
Senior Systems Engineer | Gluster
408-829-9953 <callto:+1408-829-9953>| San Francisco, CA
http://www.gluster.com/gluster-for-aws/
> ------------------------------------------------------------------------
>
> Craig Carl <mailto:craig at gluster.com>
> June 24, 2011 1:57 PM
>
>
> All -
>
> Gluster has identified a serious issue that affects anyone hosting VM
> images for the Enomaly Elastic Cloud Platform with Gluster. This issue
> is limited to Gluster versions =<3.1.4. We strongly encourage anyone
> using Enomaly and Gluster to upgrade to Gluster version 3.1.5 or higher.
>
> What causes the failure -
>
> Use a distribute-replicate volume.
> Mount using either NFS or the GlusterFS.
> Fail one of replica nodes.
> ** Production is unaffected at this point.
> Restart the failed node.
> ** All the virtual machines fail.
> ** The ecpagent service on each hypervisor will constantly restart.
>
> Root cause -
>
> Enomaly uses a locking mechanism in addition to and above the standard
> POSIX locking to make sure that a VM never starts on two servers at
> the same time. When a Enomaly sever starts a VM it writes a file
> (<randomstuff.tmp> to the directory. The combination of self-heal and
> a race between the ecpagents on the hypervisors results in the VMs
> failing to start. No data is lost or damaged.
>
> Again, this issue is specific to Enomaly. Enomaly users should
> immediatly upgrade to a version of Gluster =>3.1.5.
>
> http://download.gluster.com/pub/gluster/glusterfs/
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://supercolony.gluster.org/pipermail/gluster-users/attachments/20110624/a2bc976f/attachment.html>