thr3ads.net - Gluster users - [Gluster-users] On breaking the connection between replicated volumes certain files return -ENOTCONN [Feb 2014]

If this information is useful, please help other people find it:
Share via:
Anirban Ghoshal
2014-Feb-04 21:27 UTC
[Gluster-users] On breaking the connection between replicated volumes certain files return -ENOTCONN

Hi everyone,

Here's a strange issue. I am using glusterfs 3.4.0 alpha. We need to move to
a stable version ASAP, but I am telling you this just off chance that it might
be interesting for somebody from the glusterfs development team. Please excuse
the sheer length of this mail, but I am new to browsing such massive code, and
not good at presenting my ideas very clearly.


Here's a set of observations:

1. You have a replica 2 volume (testvol) on server1 and server2. You assume that
on either server, it is also locally mounted via mount.glusterfs at /testvol.
2. You have a large number of soft-linked files within the volume.
3. You check heal info (all its facets) to ensure not a single file is out of
sync (also, verify md5sum or such, if possible).
4. You abrupty take down the ethernet device over which the servers are
conencted (ip link set <eth-dev> down).
5. On one of the servers (say, server1 for definiteness), if you do an 'ls
-l' readlink returns 'Transport endpoint is not connected'.
6. The error resolves all by itself if you get the eth-link up.

Here's some additional detail:
7. The error is intermittent, and not all soft-linked files have the issue.
8. If you take a directory containing soft-linked files, and if you do a ls -l
_on_the_directory, like so,

server1$ ls -l /testvol/somedir/bin/

ls: cannot read symbolic link /testvol/somedir/bin/reset: Transport endpoint is
not connected
ls: cannot read symbolic link /testvol/somedir/bin/bzless: Transport endpoint is
not connected
ls: cannot read symbolic link /testvol/somedir/bin/i386: Transport endpoint is
not connected
ls: cannot read symbolic link /testvol/somedir/bin/kill: Transport endpoint is
not connected
ls: cannot read symbolic link /testvol/somedir/bin/linux32: Transport endpoint
is not connected
ls: cannot read symbolic link /testvol/somedir/bin/linux64: Transport endpoint
is not connected
ls: cannot read symbolic link /testvol/somedir/bin/logger: Transport endpoint is
not connected
ls: cannot read symbolic link /testvol/somedir/bin/x86_64: Transport endpoint is
not connected
ls: cannot read symbolic link /testvol/somedir/bin/python2: Transport endpoint
is not connected
connected


9. If, however, you take a faulty soft-link and do an ls -l on it directly, then
it rights itself immediately.

server1$ ls -l /testvol/somedir/bin/x86_64
lrwxrwxrwx 1 root root 7 May? 7 23:11 /testvol/somedir/bin/x86_64 -> setarch


I tried raising the client log level to 'trace'. Here's what I saw:

Upon READLINK failures, (ls -l /testvol/somedir/bin/):

[2010-05-09 01:13:28.140265] T [fuse-bridge.c:2453:fuse_readdir_cbk]
0-glusterfs-fuse: 2783484: READDIR => 23/4096,1380
[2010-05-09 01:13:28.140444] T [fuse-resolve.c:51:fuse_resolve_loc_touchup]
0-fuse: return value inode_path 45
[2010-05-09 01:13:28.140477] T [fuse-bridge.c:708:fuse_getattr_resume]
0-glusterfs-fuse: 2783485: GETATTR 140299577689176 (/testvol/somedir/bin)
[2010-05-09 01:13:28.140618] T [fuse-bridge.c:641:fuse_attr_cbk]
0-glusterfs-fuse: 2783485: STAT() /testvol/somedir/bin =>
-5626802993936595428
[2010-05-09 01:13:28.140722] T [fuse-resolve.c:51:fuse_resolve_loc_touchup]
0-fuse: return value inode_path 52
[2010-05-09 01:13:28.140737] T [fuse-bridge.c:506:fuse_lookup_resume]
0-glusterfs-fuse: 2783486: LOOKUP
/testvol/somedir/bin/x86_64(025d1c57-865f-4f1f-bc95-96ddcef3dc03)
[2010-05-09 01:13:28.140851] T [fuse-bridge.c:376:fuse_entry_cbk]
0-glusterfs-fuse: 2783486: LOOKUP() /testvol/somedir/bin/x86_64 =>
-4857810743645185021
[2010-05-09 01:13:28.140954] T [fuse-resolve.c:51:fuse_resolve_loc_touchup]
0-fuse: return value inode_path 52
[2010-05-09 01:13:28.140975] T [fuse-bridge.c:1296:fuse_readlink_resume]
0-glusterfs-fuse: 2783487 READLINK
/testvol/somedir/bin/x86_64/025d1c57-865f-4f1f-bc95-96ddcef3dc03
[2010-05-09 01:13:28.141090] D [afr-common.c:760:afr_get_call_child]
0-_testvol-replicate-0: Returning -107, call_child: -1, last_index: -1
[2010-05-09 01:13:28.141120] W [fuse-bridge.c:1271:fuse_readlink_cbk]
0-glusterfs-fuse: 2783487: /testvol/somedir/bin/x86_64 => -1 (Transport
endpoint is not connected)

Upon successful readlink (ls -l /testvol/somedir/bin/x86_64):

[2010-05-09 01:13:37.717904] T [fuse-bridge.c:376:fuse_entry_cbk]
0-glusterfs-fuse: 2790073: LOOKUP() /testvol/somedir/bin =>
-5626802993936595428
[2010-05-09 01:13:37.718070] T [fuse-resolve.c:51:fuse_resolve_loc_touchup]
0-fuse: return value inode_path 52
[2010-05-09 01:13:37.718127] T [fuse-bridge.c:506:fuse_lookup_resume]
0-glusterfs-fuse: 2790074: LOOKUP
/testvol/somedir/bin/x86_64(025d1c57-865f-4f1f-bc95-96ddcef3dc03)
[2010-05-09 01:13:37.718306] D [afr-common.c:131:afr_lookup_xattr_req_prepare]
0-_testvol-replicate-0: /testvol/somedir/bin/x86_64: failed to get the gfid from
dict
[2010-05-09 01:13:37.718355] T [rpc-clnt.c:1301:rpc_clnt_record]
0-_testvol-client-1: Auth Info: pid: 3343, uid: 0, gid: 0, owner:
0000000000000000
[2010-05-09 01:13:37.718383] T [rpc-clnt.c:1181:rpc_clnt_record_build_header]
0-rpc-clnt: Request fraglen 464, payload: 368, rpc hdr: 96
[2010-05-09 01:13:37.718459] T [rpc-clnt.c:1498:rpc_clnt_submit] 0-rpc-clnt:
submitted request (XID: 0x1527955x Program: GlusterFS 3.3, ProgVers: 330, Proc:
27) to rpc-transport (_testvol-client-1)
[2010-05-09 01:13:37.718775] T [rpc-clnt.c:669:rpc_clnt_reply_init]
0-_testvol-client-1: received rpc message (RPC XID: 0x1527955x Program:
GlusterFS 3.3, ProgVers: 330, Proc: 27) from rpc-transport (_testvol-client-1)
[2010-05-09 01:13:37.718848] D
[afr-self-heal-common.c:138:afr_sh_print_pending_matrix] 0-_testvol-replicate-0:
pending_matrix: [ 0 0 ]
[2010-05-09 01:13:37.718884] D
[afr-self-heal-common.c:138:afr_sh_print_pending_matrix] 0-_testvol-replicate-0:
pending_matrix: [ 1 0 ]
[2010-05-09 01:13:37.718893] D [afr-self-heal-common.c:887:afr_mark_sources]
0-_testvol-replicate-0: Number of sources: 1
[2010-05-09 01:13:37.718903] D
[afr-self-heal-data.c:814:afr_lookup_select_read_child_by_txn_type]
0-_testvol-replicate-0: returning read_child: 1
[2010-05-09 01:13:37.718916] D [afr-common.c:1397:afr_lookup_select_read_child]
0-_testvol-replicate-0: Source selected as 1 for /testvol/somedir/bin/x86_64
[2010-05-09 01:13:37.718927] D
[afr-common.c:1134:afr_lookup_build_response_params] 0-_testvol-replicate-0:
Building lookup response from 1
[2010-05-09 01:13:37.718954] D [afr-common.c:1749:afr_lookup_perform_self_heal]
0-_testvol-replicate-0: Only 1 child up - do not attempt to detect self heal
[2010-05-09 01:13:37.718970] T [io-cache.c:224:ioc_lookup_cbk]
0-_testvol-io-cache: locked inode(0x7f9a057d3fe0)
[2010-05-09 01:13:37.718981] T [io-cache.c:233:ioc_lookup_cbk]
0-_testvol-io-cache: unlocked inode(0x7f9a057d3fe0)
[2010-05-09 01:13:37.718991] T [io-cache.c:128:ioc_inode_flush]
0-_testvol-io-cache: locked inode(0x7f9a057d3fe0)
[2010-05-09 01:13:37.719001] T [io-cache.c:132:ioc_inode_flush]
0-_testvol-io-cache: unlocked inode(0x7f9a057d3fe0)
[2010-05-09 01:13:37.719010] T [io-cache.c:242:ioc_lookup_cbk]
0-_testvol-io-cache: locked table(0x668ec0)
[2010-05-09 01:13:37.719020] T [io-cache.c:247:ioc_lookup_cbk]
0-_testvol-io-cache: unlocked table(0x668ec0)
[2010-05-09 01:13:37.719050] T [fuse-bridge.c:376:fuse_entry_cbk]
0-glusterfs-fuse: 2790074: LOOKUP() /testvol/somedir/bin/x86_64 =>
-4857810743645185021
[2010-05-09 01:13:37.719239] T [fuse-resolve.c:51:fuse_resolve_loc_touchup]
0-fuse: return value inode_path 52
[2010-05-09 01:13:37.719291] T [fuse-bridge.c:1296:fuse_readlink_resume]
0-glusterfs-fuse: 2790075 READLINK
/testvol/somedir/bin/x86_64/025d1c57-865f-4f1f-bc95-96ddcef3dc03
[2010-05-09 01:13:37.719426] D [afr-common.c:760:afr_get_call_child]
0-_testvol-replicate-0: Returning 0, call_child: 1, last_index: -1
[2010-05-09 01:13:37.719468] T [rpc-clnt.c:1301:rpc_clnt_record]
0-_testvol-client-1: Auth Info: pid: 3343, uid: 0, gid: 0, owner:
0000000000000000
[2010-05-09 01:13:37.719485] T [rpc-clnt.c:1181:rpc_clnt_record_build_header]
0-rpc-clnt: Request fraglen 120, payload: 24, rpc hdr: 96
[2010-05-09 01:13:37.719561] T [rpc-clnt.c:1498:rpc_clnt_submit] 0-rpc-clnt:
submitted request (XID: 0x1527956x Program: GlusterFS 3.3, ProgVers: 330, Proc:
2) to rpc-transport (_testvol-client-1)
[2010-05-09 01:13:37.719728] T [rpc-clnt.c:669:rpc_clnt_reply_init]
0-_testvol-client-1: received rpc message (RPC XID: 0x1527956x Program:
GlusterFS 3.3, ProgVers: 330, Proc: 2) from rpc-transport (_testvol-client-1)
[2010-05-09 01:13:37.719780] T [fuse-bridge.c:1265:fuse_readlink_cbk]
0-glusterfs-fuse: 2790075: /testvol/somedir/bin/x86_64 => setarch



I cannot quite figure out how the whole thing works, but looks like in the
failure case we end up with the wrong set of read_child(ren). If we first take a
look at the fuse_lookup_cbk, which I reproduce here,


static int
fuse_lookup_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
???????????????? int32_t op_ret, int32_t op_errno,
???????????????? inode_t *inode, struct iatt *stat, dict_t *dict,
???????????????? struct iatt *postparent)
{
??????? fuse_state_t??????????? *state = NULL;
??????? call_frame_t??????????? *prev = NULL;
??????? inode_table_t?????????? *itable = NULL;

??????? state = frame->root->state;
??????? prev? = cookie;

??????? if (op_ret == -1 && state->is_revalidate == 1) {
??????????????? itable = state->itable;
??????????????? inode_unref (state->loc.inode);
??????????????? state->loc.inode = inode_new (itable);
??????????????? state->is_revalidate = 2;
??????????????? if (uuid_is_null (state->gfid))
??????????????????????? uuid_generate (state->gfid);
??????????????? fuse_gfid_set (state);

??????????????? STACK_WIND (frame, fuse_lookup_cbk,
??????????????????????????? prev->this, prev->this->fops->lookup,
??????????????????????????? &state->loc, state->xdata);
??????????????? return 0;
??????? }

??????? fuse_entry_cbk (frame, cookie, this, op_ret, op_errno, inode, stat,
??????????????????????? dict);
??????? return 0;
}


then one can see that a resolve failure causing op_ret to be -1 (note, from logs
its is_revalidate == 1 for both cases) would cause the mount program to refresh
the inode context. In fact, this is where the flow diverges between the success
and the failure scenarios. I am inclined to think that at this point a soft-link
on which readlink fails (on the mount on server1) has its read-child set as the
afr-client on server2. Thus, in theory, a refresh of context would be necessary.

Now, if we take a look at the actual failure log:

[2010-05-09 01:13:28.141090] D [afr-common.c:760:afr_get_call_child] 
0-_testvol-replicate-0: Returning -107, call_child: -1, last_index: -1
[2010-05-09
 01:13:28.141120] W [fuse-bridge.c:1271:fuse_readlink_cbk] 
0-glusterfs-fuse: 2783487: /testvol/somedir/bin/x86_64 => -1 
(Transport endpoint is not connected)

Here, if we see the below code snippet from afr_get_call_child(), then looks
like there might be a problem (or maybe, I am just not reading it right)?

??????? if (child_up[read_child]) {
??????????????? *call_child = read_child;
??????? } else {
??????????????? for (i = 0; i < priv->child_count; i++) {
??????????????????????? if (fresh_children[i] == -1)
??????????????????????????????? break;
??????????????????????? if (child_up[fresh_children[i]]) {
??????????????????????????????? *call_child = fresh_children[i];
??????????????????????????????? ret = 0;
??????????????????????????????? break;
??????????????????????? }
??????????????? }

??????????????? if (*call_child == -1) {
??????????????????????? ret = -ENOTCONN;
??????????????????????? goto out;
??????????????? }

??????????????? *last_index = i;
??????? }
out:
??????? gf_log (this->name, GF_LOG_DEBUG, "Returning %d, call_child: %d,
"
??????????????? "last_index: %d", ret, *call_child, *last_index);
??????? return ret;


I would suppose 'child_up [read_child]' should equal 0, in which case we
would loop over all children. This is where I am confused. If we read a link on
server1, and server1 allows me to read so manyother files correctly, then surely
all subvolumes on server1 MUST be active? But then, we return -ENOTCONN only
when it is discovered that not one of the subvolumes is active!


It would be great if somebody could confirm my ideas for me. Also, should I need
to open a bug for this?


Thanks a lot for your answers!
Anirban

P.s. And yes, umount/mount does resolve the problem as well.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://supercolony.gluster.org/pipermail/gluster-users/attachments/20140205/4790132c/attachment.html>
Gluster users - Feb 2014 - On breaking the connection between replicated volumes certain files return -ENOTCONN

[Gluster-users] On breaking the connection between replicated volumes certain files return -ENOTCONN