thr3ads.net - Gluster users - [Gluster-users] Recovering out of sync nodes from input/output error [Apr 2012]

If this information is useful, please help other people find it:
Share via:

Alex Florescu

2012-Apr-11 11:00 UTC

[Gluster-users] Recovering out of sync nodes from input/output error

Hello,

We use gluster in our production environment and recently encountered an
unrecoverable error which was solved only by deleting the existing volume
and local stored files and recreating from scratch.

I am now playing with a test environment which almost mirrors the prod and
I can always reproduce the problem.
We have websites on two servers which use gluster for common usage files.
We also use DNS round robin for request balancing (this is a main element
in the scenario).

Setup: Two servers running Gentoo 2.0.3 kernel 3.0.6, glusterfs 3.2.5
Gluster commands:
gluster volume create vol-replication replica 2 transport tcp 10.0.2.14:/local
10.0.2.15:/local
gluster volume start vol-replication
gluster volume set vol-replication network.ping-timeout 1
node1 (10.0.2.14): mount -t glusterfs 10.0.2.14:/vol-replication /a
node2 (10.0.2.15): mount -t glusterfs 10.0.2.15:/vol-replication /a

Now assume that connectivity between the two nodes has failed, but they can
still be accessed from the outside world and files can be written on them
through Apache.
Request 1 -> 10.0.2.14 -> creates file howareyou
Request 2 -> 10.0.2.15 -> creates file hello
At some point, connectivity between the two nodes recovers and disaster
strikes:
ls /a
ls: cannot access /a: Input/output error

Simulation follows:
step 1
node1:
iptables -I INPUT 1 -s 10.0.2.15 -j DROP (connectivity loss simulation)
touch /a/howareyou

node2:
touch /a/hello

step 2
node1:
iptables -D INPUT 1 (connectivity recovery)
ls /a
ls: cannot access /a: Input/output error

node2:
ls /a
ls: cannot access /a: Input/output error

The only way to recover this was to delete the offending files. This was
easy to do on the test environment because there were two files involved,
but on the prod environment we had many more and I managed to recover only
after deleting the gluster volume and the local content including the local
storage directory itself! Nothing else of what I tried (stopping volume,
recreating volume, emptying the local storage directory, remounting,
restarting gluster) worked.

Any hint on how one could recover from this sort of situation?
Thank you.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://supercolony.gluster.org/pipermail/gluster-users/attachments/20120411/9c47e4be/attachment.html>

Robert Hajime Lanning

2012-Apr-11 22:45 UTC

head link

[Gluster-users] Recovering out of sync nodes from input/output error

On 04/11/12 04:00, Alex Florescu wrote:> I am now playing with a test environment which almost mirrors the prod
> and I can always reproduce the problem.
> We have websites on two servers which use gluster for common usage
> files. We also use DNS round robin for request balancing (this is a main
> element in the scenario).
>
> Setup: Two servers running Gentoo 2.0.3 kernel 3.0.6, glusterfs 3.2.5
> Gluster commands:
> gluster volume create vol-replication replica 2 transport tcp
> 10.0.2.14:/local 10.0.2.15:/local
> gluster volume start vol-replication
> gluster volume set vol-replication network.ping-timeout 1
> node1 (10.0.2.14): mount -t glusterfs 10.0.2.14:/vol-replication /a
> node2 (10.0.2.15): mount -t glusterfs 10.0.2.15:/vol-replication /a
>
> Now assume that connectivity between the two nodes has failed, but they
> can still be accessed from the outside world and files can be written on
> them through Apache.
> Request 1 -> 10.0.2.14 -> creates file howareyou
> Request 2 -> 10.0.2.15 -> creates file hello
So, now you have a "split-brain" problem.
> At some point, connectivity between the two nodes recovers and disaster
> strikes:
> ls /a
> ls: cannot access /a: Input/output error
Which directory is the "source of truth"?

did "howareyou" exist on 10.0.2.15 and was deleted during the outage,
or
is it a new file?
vice versa for "hello"

So, when you look at the directory itself, which state is correct?

Gluster does not have a transaction log for each brick, to sync
across.
>
> The only way to recover this was to delete the offending files. This was
> easy to do on the test environment because there were two files
> involved, but on the prod environment we had many more and I managed to
> recover only after deleting the gluster volume and the local content
> including the local storage directory itself! Nothing else of what I
> tried (stopping volume, recreating volume, emptying the local storage
> directory, remounting, restarting gluster) worked.
>
> Any hint on how one could recover from this sort of situation?
> Thank you.
Tar replica1 and untar on replica2.  Then delete everything on replica1.
Then self-heal should take care of the rest.

-- 
Mr. Flibble
King of the Potato People

Jeff Darcy

2012-Apr-12 12:49 UTC

head link

[Gluster-users] Recovering out of sync nodes from input/output error

On 04/11/2012 07:00 AM, Alex Florescu wrote:> Simulation follows:
> step 1
> node1:
> iptables -I INPUT 1 -s 10.0.2.15 -j DROP (connectivity loss simulation)
> touch /a/howareyou
> 
> node2:
> touch /a/hello
> 
> step 2
> node1:
> iptables -D INPUT 1 (connectivity recovery)
> ls /a
> ls: cannot access /a: Input/output error
> 
> node2:
> ls /a
> ls: cannot access /a: Input/output error
I was able to reproduce this on my own setup using packages built from git,
which has a bit of a surprise TBH.  I'll look into it, but here are some
observations that might suggest workarounds.

(1) To a first approximation, it should be safe to "merge" directory
contents
despite there being a split-brain problem, by healing any file that exists on
only one brick from there to its peer(s).  This contrasts with the case for
file contents, where - as Robert points out - we can't determine the correct
thing to do and would risk overwriting data.  Directory entries differ from
file contents in a small but important way: they're sets, not arrays.  If
something's not in the set, there's no danger that adding it will
overwrite
anything.

(2) That said, the case you've created is indistinguishable from the case
where
"hello" and "howareyou" used to exist on both bricks and
each *deleted* one
while they couldn't communicate.  Unconditionally recreating the files would
effectively undo those deletes, which many would consider an error as serious
as overwriting data.  It would not be valid for such merge behavior to kick in
unconditionally.  At the very least, there should be a configuration option for
it.

(3) The reason you continue to get I/O errors is probably that the xattrs on
the *parent directory* still indicate pending operations on both sides.  You
can verify this with the following command on each brick:

	getfattr -d -e hex -n trusted.glusterfs.dht /a

The format of this value is described here:

	http://hekafs.org/index.php/2011/04/glusterfs-extended-attributes/

If the result is non-zero (most likely in the last four-byte integer indicating
a directory-entry operation) then that confirms our theory.  It should be safe
for the self-heal code to clear these counts if (and only if) the directories
are checked and found identical.  In fact, I think we already do this.  Thus,
manual copying of files followed by self-heal on the parent directory should
make the errors go away.  I encourage you to try that while I go look at the
code.

Rodrigo Severo

2012-Apr-13 13:05 UTC

head link

[Gluster-users] Recovering out of sync nodes from input/output error

On Fri, Apr 13, 2012 at 9:10 AM, Alex Florescu <
alex.florescu at tripsolutions.co.uk> wrote:
>
>
> On Fri, Apr 13, 2012 at 2:23 PM, Robert Hajime Lanning wrote:
>
>>  How about:
>> getfattr -d -n trusted.gluster.dht /local
>>
>> He was asking for the attribute on the directory, not the file.
>
>
> Sorry, I ran that too but forgot to include it. It's the same.
>
>  getfattr -d -n trusted.gluster.dht /local
> /local: trusted.gluster.dht: No such attribute
> getfattr -d /local
> <blank>
>
You are using *-d* and *-n* on the same getfattr. That's wrong AFAICT.

Try *getfattr -d -m . -e hex -h BRICK/PATH/TO/FILE/OR/DIRECTORY

*You should get a full list of extended attributes related to the
file/directory.
*

*Rodrigo
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://supercolony.gluster.org/pipermail/gluster-users/attachments/20120413/e2c434d6/attachment.html>

Jeff Darcy

2012-Apr-16 16:48 UTC

head link

[Gluster-users] Recovering out of sync nodes from input/output error

FYI, this is now being tracked as

https://bugzilla.redhat.com/show_bug.cgi?id=812963

Gluster users - Apr 2012 - Recovering out of sync nodes from input/output error

[Gluster-users] Recovering out of sync nodes from input/output error

[Gluster-users] Recovering out of sync nodes from input/output error

[Gluster-users] Recovering out of sync nodes from input/output error

[Gluster-users] Recovering out of sync nodes from input/output error

[Gluster-users] Recovering out of sync nodes from input/output error