Pranith Kumar Karampuri
2016-Nov-14 10:50 UTC
[Gluster-users] 3.7.16 with sharding corrupts VMDK files when adding and removing bricks
On Sat, Nov 12, 2016 at 4:28 PM, Gandalf Corvotempesta < gandalf.corvotempesta at gmail.com> wrote:> Il 12 nov 2016 10:21, "Kevin Lemonnier" <lemonnierk at ulrar.net> ha scritto: > > We've had a lot of problems in the past, but at least for us 3.7.12 (and > 3.7.15) > > seems to be working pretty well as long as you don't add bricks. We > started doing > > multiple little clusters and abandonned the idea of one big cluster, had > no > > issues since :) > > > > Well, adding bricks could be usefull... :) > > Having to create multiple cluster is not a solution and is much more > expansive. > And if you corrupt data from a single cluster you still have issues > > I think would be better to add less features and focus more to stability. >First of all, thanks to all the folks who contributed to this thread. We value your feedback. In gluster-users and ovirt-community we saw people trying gluster and complain about heal times and split-brains. So we had to fix bugs in quorum in 3-way replication; then we started working on features like sharding for better heal times and arbiter volumes for cost benefits. To make gluster stable for VM images we had to add all these new features and then fix all the bugs Lindsay/Kevin reported. We just fixed a corruption issue that can happen with replace-brick which will be available in 3.9.0 and 3.8.6. The only 2 other known issues that can lead to corruptions are add-brick and the bug you filed Gandalf. Krutika just 5 minutes back saw something that could possibly lead to the corruption for the add-brick bug. Is that really the Root cause? We are not sure yet, we need more time. Without Lindsay/Kevin/David Gossage's support this workload would have been in much worse condition. These bugs are not easy to re-create thus not easy to fix. At least that has been Krutika's experience. Take away from this mail thread for me is: I think it is important to educate users about why we are adding new features. People are coming to the conclusion that only bug fixing corresponds to stabilization and not features. It is a wrong perception. Without the work that went into adding all those new features above in gluster, most probably you guys wouldn't have given gluster another chance because it used to be unusable before these features for VM workloads. One more take away is to get the documentation right. Lack of documentation led Alex to try the worst possible combo for storing VMs on gluster. So we as community failed in some way there as well. Krutika will be sending out VM usecase related documentation after 28th of this month. If you have any other feedback, do let us know. In a software defined storage, stability and consistency are the most> important things > > I'm also subscribed to moosefs and lizardfs mailing list and I don't > recall any single data corruption/data loss event > > In gluster, after some days of testing I've found a huge data corruption > issue that is still unfixed on bugzilla. > If you change the shard size on a populated cluster, you break all > existing data. > Try to do this on a cluster with working VMs and see what happens.... > a single cli command break everything and is still unfixed. > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://www.gluster.org/mailman/listinfo/gluster-users >-- Pranith -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20161114/8432450a/attachment.html>
Gandalf Corvotempesta
2016-Nov-14 11:08 UTC
[Gluster-users] 3.7.16 with sharding corrupts VMDK files when adding and removing bricks
2016-11-14 11:50 GMT+01:00 Pranith Kumar Karampuri <pkarampu at redhat.com>:> To make gluster stable for VM images we had to add all these new features > and then fix all the bugs Lindsay/Kevin reported. We just fixed a corruption > issue that can happen with replace-brick which will be available in 3.9.0 > and 3.8.6. The only 2 other known issues that can lead to corruptions are > add-brick and the bug you filed Gandalf. Krutika just 5 minutes back saw > something that could possibly lead to the corruption for the add-brick bug. > Is that really the Root cause? We are not sure yet, we need more time. > Without Lindsay/Kevin/David Gossage's support this workload would have been > in much worse condition. These bugs are not easy to re-create thus not easy > to fix. At least that has been Krutika's experience.Ok, but this changes should be placed in a "test" version and not marked as stable. I don't see any development release, only stable releases here. Do you want all features ? Try the "beta/rc/unstable/alpha/dev" version. Do you want the stable version without known bugs but slow on VMs workload? Use the "-stable" version. If you relase as stable, users tend to upgrade their cluster and use the newer feature (that you are marking as stable). What If I upgrade a production cluster to a stable version and try to add-brick that lead to data corruption ? I have to restore terabytes worth of data? Gluster is made for scale-out, what I my cluster was made with 500TB of VMs ? Try to restore 500TB from a backup.................... This is unacceptable. add-brick/replace-brick should be common "daily" operations. You should heavy check these for regression or bug.> One more take away is to get the > documentation right. Lack of documentation led Alex to try the worst > possible combo for storing VMs on gluster. So we as community failed in some > way there as well. > > Krutika will be sending out VM usecase related documentation after > 28th of this month. If you have any other feedback, do let us know.Yes, lack of updated docs or a reference architecture is a big issue.