Kingsley
2016-Jul-15 16:25 UTC
[Gluster-users] lingering <gfid:*> entries in volume heal, gluster 3.6.3
On Fri, 2016-07-15 at 21:41 +0530, Ravishankar N wrote:> On 07/15/2016 09:32 PM, Kingsley wrote: > > On Fri, 2016-07-15 at 21:06 +0530, Ravishankar N wrote: > >> On 07/15/2016 08:48 PM, Kingsley wrote: > >>> I don't have star installed so I used ls, > >> Oops typo. I meant `stat`. > >>> but yes they all have 2 links > >>> to them (see below). > >>> > >> Everything seems to be in place for the heal to happen. Can you tailf > >> the output of shd logs on all nodes and manually launch gluster vol heal > >> volname? > >> Use DEBUG log level if you have to and examine the output for clues. > > I presume I can do that with this command: > > > > gluster volume set callrec diagnostics.brick-log-level DEBUG > shd is a client process, so it is diagnostics.client-log-level. This > would affect your mounts too. > > > > How can I find out what the log level is at the moment, so that I can > > put it back afterwards? > INFO. you can also use `gluster volume reset`.Thanks.> >> Also, some dumb things to check: are all the bricks really up and is the > >> shd connected to them etc. > > All bricks are definitely up. I just created a file on a client and it > > appeared in all 4 bricks. > > > > I don't know how to tell whether the shd is connected to all of them, > > though. > Latest messages like "connected to client-xxx " and "disconnected from > client-xxx" in the shd logs. Just like in the mount logs.This has revealed something. I'm now seeing lots of lines like this in the shd log: [2016-07-15 16:20:51.098152] D [afr-self-heald.c:516:afr_shd_index_sweep] 0-callrec-replicate-0: got entry: eaa43674-b1a3-4833-a946-de7b7121bb88 [2016-07-15 16:20:51.099346] D [client-rpc-fops.c:1523:client3_3_inodelk_cbk] 0-callrec-client-2: remote operation failed: Stale file handle [2016-07-15 16:20:51.100683] D [client-rpc-fops.c:2686:client3_3_opendir_cbk] 0-callrec-client-2: remote operation failed: Stale file handle. Path: <gfid:eaa43674-b1a3-4833-a946-de7b7121bb88> (eaa43674-b1a3-4833-a946-de7b7121bb88) [2016-07-15 16:20:51.101180] D [client-rpc-fops.c:1627:client3_3_entrylk_cbk] 0-callrec-client-2: remote operation failed: Stale file handle [2016-07-15 16:20:51.101663] D [client-rpc-fops.c:1627:client3_3_entrylk_cbk] 0-callrec-client-2: remote operation failed: Stale file handle [2016-07-15 16:20:51.102056] D [client-rpc-fops.c:1627:client3_3_entrylk_cbk] 0-callrec-client-2: remote operation failed: Stale file handle These lines continued to be written to the log even after I manually launched the self heal (which it told me had been launched successfully). I also tried repeating that command on one of the bricks that was giving those messages, but that made no difference. Client 2 would correspond to the one that had been offline, so how do I get the shd to reconnect to that brick? I did a ps but I couldn't see any processes with glustershd in the name, else I'd have tried sending that a HUP. Cheers, Kingsley.
Ravishankar N
2016-Jul-15 16:54 UTC
[Gluster-users] lingering <gfid:*> entries in volume heal, gluster 3.6.3
On 07/15/2016 09:55 PM, Kingsley wrote:> This has revealed something. I'm now seeing lots of lines like this in > the shd log: > > [2016-07-15 16:20:51.098152] D [afr-self-heald.c:516:afr_shd_index_sweep] 0-callrec-replicate-0: got entry: eaa43674-b1a3-4833-a946-de7b7121bb88 > [2016-07-15 16:20:51.099346] D [client-rpc-fops.c:1523:client3_3_inodelk_cbk] 0-callrec-client-2: remote operation failed: Stale file handle > [2016-07-15 16:20:51.100683] D [client-rpc-fops.c:2686:client3_3_opendir_cbk] 0-callrec-client-2: remote operation failed: Stale file handle. Path: <gfid:eaa43674-b1a3-4833-a946-de7b7121bb88> (eaa43674-b1a3-4833-a946-de7b7121bb88)Looks like the files are not present at all in client-2 which is why you see these messages. Find out the files/directory names corresponding to these gfids from one of the healthy bricks and see if they are present in client-2 as well. If not try accessing them from the mount. That should create any missing entries in client-2. Then launch heal again. Hope this helps. Ravi> [2016-07-15 16:20:51.101180] D [client-rpc-fops.c:1627:client3_3_entrylk_cbk] 0-callrec-client-2: remote operation failed: Stale file handle > [2016-07-15 16:20:51.101663] D [client-rpc-fops.c:1627:client3_3_entrylk_cbk] 0-callrec-client-2: remote operation failed: Stale file handle > [2016-07-15 16:20:51.102056] D [client-rpc-fops.c:1627:client3_3_entrylk_cbk] 0-callrec-client-2: remote operation failed: Stale file handle > > These lines continued to be written to the log even after I manually > launched the self heal (which it told me had been launched > successfully). I also tried repeating that command on one of the bricks > that was giving those messages, but that made no difference. > > Client 2 would correspond to the one that had been offline, so how do I > get the shd to reconnect to that brick? I did a ps but I couldn't see > any processes with glustershd in the name, else I'd have tried sending > that a HUP.