Alex Florescu
2012-Apr-11 11:00 UTC
[Gluster-users] Recovering out of sync nodes from input/output error
Hello, We use gluster in our production environment and recently encountered an unrecoverable error which was solved only by deleting the existing volume and local stored files and recreating from scratch. I am now playing with a test environment which almost mirrors the prod and I can always reproduce the problem. We have websites on two servers which use gluster for common usage files. We also use DNS round robin for request balancing (this is a main element in the scenario). Setup: Two servers running Gentoo 2.0.3 kernel 3.0.6, glusterfs 3.2.5 Gluster commands: gluster volume create vol-replication replica 2 transport tcp 10.0.2.14:/local 10.0.2.15:/local gluster volume start vol-replication gluster volume set vol-replication network.ping-timeout 1 node1 (10.0.2.14): mount -t glusterfs 10.0.2.14:/vol-replication /a node2 (10.0.2.15): mount -t glusterfs 10.0.2.15:/vol-replication /a Now assume that connectivity between the two nodes has failed, but they can still be accessed from the outside world and files can be written on them through Apache. Request 1 -> 10.0.2.14 -> creates file howareyou Request 2 -> 10.0.2.15 -> creates file hello At some point, connectivity between the two nodes recovers and disaster strikes: ls /a ls: cannot access /a: Input/output error Simulation follows: step 1 node1: iptables -I INPUT 1 -s 10.0.2.15 -j DROP (connectivity loss simulation) touch /a/howareyou node2: touch /a/hello step 2 node1: iptables -D INPUT 1 (connectivity recovery) ls /a ls: cannot access /a: Input/output error node2: ls /a ls: cannot access /a: Input/output error The only way to recover this was to delete the offending files. This was easy to do on the test environment because there were two files involved, but on the prod environment we had many more and I managed to recover only after deleting the gluster volume and the local content including the local storage directory itself! Nothing else of what I tried (stopping volume, recreating volume, emptying the local storage directory, remounting, restarting gluster) worked. Any hint on how one could recover from this sort of situation? Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20120411/9c47e4be/attachment.html>
Robert Hajime Lanning
2012-Apr-11 22:45 UTC
[Gluster-users] Recovering out of sync nodes from input/output error
On 04/11/12 04:00, Alex Florescu wrote:> I am now playing with a test environment which almost mirrors the prod > and I can always reproduce the problem. > We have websites on two servers which use gluster for common usage > files. We also use DNS round robin for request balancing (this is a main > element in the scenario). > > Setup: Two servers running Gentoo 2.0.3 kernel 3.0.6, glusterfs 3.2.5 > Gluster commands: > gluster volume create vol-replication replica 2 transport tcp > 10.0.2.14:/local 10.0.2.15:/local > gluster volume start vol-replication > gluster volume set vol-replication network.ping-timeout 1 > node1 (10.0.2.14): mount -t glusterfs 10.0.2.14:/vol-replication /a > node2 (10.0.2.15): mount -t glusterfs 10.0.2.15:/vol-replication /a > > Now assume that connectivity between the two nodes has failed, but they > can still be accessed from the outside world and files can be written on > them through Apache. > Request 1 -> 10.0.2.14 -> creates file howareyou > Request 2 -> 10.0.2.15 -> creates file helloSo, now you have a "split-brain" problem.> At some point, connectivity between the two nodes recovers and disaster > strikes: > ls /a > ls: cannot access /a: Input/output errorWhich directory is the "source of truth"? did "howareyou" exist on 10.0.2.15 and was deleted during the outage, or is it a new file? vice versa for "hello" So, when you look at the directory itself, which state is correct? Gluster does not have a transaction log for each brick, to sync across.> > The only way to recover this was to delete the offending files. This was > easy to do on the test environment because there were two files > involved, but on the prod environment we had many more and I managed to > recover only after deleting the gluster volume and the local content > including the local storage directory itself! Nothing else of what I > tried (stopping volume, recreating volume, emptying the local storage > directory, remounting, restarting gluster) worked. > > Any hint on how one could recover from this sort of situation? > Thank you.Tar replica1 and untar on replica2. Then delete everything on replica1. Then self-heal should take care of the rest. -- Mr. Flibble King of the Potato People
Jeff Darcy
2012-Apr-12 12:49 UTC
[Gluster-users] Recovering out of sync nodes from input/output error
On 04/11/2012 07:00 AM, Alex Florescu wrote:> Simulation follows: > step 1 > node1: > iptables -I INPUT 1 -s 10.0.2.15 -j DROP (connectivity loss simulation) > touch /a/howareyou > > node2: > touch /a/hello > > step 2 > node1: > iptables -D INPUT 1 (connectivity recovery) > ls /a > ls: cannot access /a: Input/output error > > node2: > ls /a > ls: cannot access /a: Input/output errorI was able to reproduce this on my own setup using packages built from git, which has a bit of a surprise TBH. I'll look into it, but here are some observations that might suggest workarounds. (1) To a first approximation, it should be safe to "merge" directory contents despite there being a split-brain problem, by healing any file that exists on only one brick from there to its peer(s). This contrasts with the case for file contents, where - as Robert points out - we can't determine the correct thing to do and would risk overwriting data. Directory entries differ from file contents in a small but important way: they're sets, not arrays. If something's not in the set, there's no danger that adding it will overwrite anything. (2) That said, the case you've created is indistinguishable from the case where "hello" and "howareyou" used to exist on both bricks and each *deleted* one while they couldn't communicate. Unconditionally recreating the files would effectively undo those deletes, which many would consider an error as serious as overwriting data. It would not be valid for such merge behavior to kick in unconditionally. At the very least, there should be a configuration option for it. (3) The reason you continue to get I/O errors is probably that the xattrs on the *parent directory* still indicate pending operations on both sides. You can verify this with the following command on each brick: getfattr -d -e hex -n trusted.glusterfs.dht /a The format of this value is described here: http://hekafs.org/index.php/2011/04/glusterfs-extended-attributes/ If the result is non-zero (most likely in the last four-byte integer indicating a directory-entry operation) then that confirms our theory. It should be safe for the self-heal code to clear these counts if (and only if) the directories are checked and found identical. In fact, I think we already do this. Thus, manual copying of files followed by self-heal on the parent directory should make the errors go away. I encourage you to try that while I go look at the code.
Rodrigo Severo
2012-Apr-13 13:05 UTC
[Gluster-users] Recovering out of sync nodes from input/output error
On Fri, Apr 13, 2012 at 9:10 AM, Alex Florescu < alex.florescu at tripsolutions.co.uk> wrote:> > > On Fri, Apr 13, 2012 at 2:23 PM, Robert Hajime Lanning wrote: > >> How about: >> getfattr -d -n trusted.gluster.dht /local >> >> He was asking for the attribute on the directory, not the file. > > > Sorry, I ran that too but forgot to include it. It's the same. > > getfattr -d -n trusted.gluster.dht /local > /local: trusted.gluster.dht: No such attribute > getfattr -d /local > <blank> >You are using *-d* and *-n* on the same getfattr. That's wrong AFAICT. Try *getfattr -d -m . -e hex -h BRICK/PATH/TO/FILE/OR/DIRECTORY *You should get a full list of extended attributes related to the file/directory. * *Rodrigo -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20120413/e2c434d6/attachment.html>