Zhou, Cynthia (NSB - CN/Hangzhou)
2018-Oct-10 08:52 UTC
[Gluster-users] when there are dangling entry(without gfid) in only one brick dir, the glusterfs heal info will keep showing the entry, glustershd can not really remove this entry from brick .
Hi glusterfs expert, I meet one problem in my test bed (3 brick on 3 sn nodes), the "/" is always in glusterfs v heal info output. In my ftest+reboot-sn-nodes-randomly test, the gluster v heal info output keeps showing entry "/" even for hours, and even you do some touch or ls of /mnt/mstate , it will not help to solve this issue. [root at sn-0:/mnt/bricks/mstate/brick] # gluster v heal mstate info Brick sn-0.local:/mnt/bricks/mstate/brick / Status: Connected Number of entries: 1 Brick sn-2.local:/mnt/bricks/mstate/brick / Status: Connected Number of entries: 1 Brick sn-1.local:/mnt/bricks/mstate/brick / Status: Connected Number of entries: 1>From sn glustershd.log I find following prints[2018-10-10 08:13:00.005250] I [MSGID: 108026] [afr-self-heald.c:432:afr_shd_index_heal] 0-mstate-replicate-0: got entry: 00000000-0000-0000-0000-000000000001 from mstate-client-0 [2018-10-10 08:13:00.006077] I [MSGID: 108026] [afr-self-heald.c:341:afr_shd_selfheal] 0-mstate-replicate-0: entry: path /, gfid: 00000000-0000-0000-0000-000000000001 [2018-10-10 08:13:00.011599] I [MSGID: 108026] [afr-self-heal-entry.c:887:afr_selfheal_entry_do] 0-mstate-replicate-0: performing entry selfheal on 00000000-0000-0000-0000-000000000001 [2018-10-10 08:16:28.722059] W [MSGID: 108015] [afr-self-heal-entry.c:47:afr_selfheal_entry_delete] 0-mstate-replicate-0: expunging dir 00000000-0000-0000-0000-000000000001/fstest_76f272545249be5d71359f06962e069b (00000000-0000-0000-0000-000000000000) on mstate-client-0 [2018-10-10 08:16:28.722975] W [MSGID: 114031] [client-rpc-fops.c:670:client3_3_rmdir_cbk] 0-mstate-client-0: remote operation failed [No such file or directory] When I check the env I find that fstest_76f272545249be5d71359f06962e069b only exists on sn-0 node brick only and the getfattr of this is empty! [root at sn-0:/mnt/bricks/mstate/brick] # getfattr -m . -d -e hex fstest_76f272545249be5d71359f06962e069b //return is empty output [root at sn-0:/mnt/bricks/mstate/brick] # getfattr -m . -d -e hex . # file: . trusted.afr.dirty=0x000000000000000000000000 trusted.afr.mstate-client-1=0x000000000000000000000000 trusted.afr.mstate-client-2=0x0000000000000000000002a7 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x000000010000000000000000ffffffff trusted.glusterfs.volume-id=0xbf7aad0e4ce44196aa9b0a33928ea2ff [root at sn-1:/root] # stat /mnt/bricks/mstate/brick/fstest_76f272545249be5d71359f06962e069b stat: cannot stat '/mnt/bricks/mstate/brick/fstest_76f272545249be5d71359f06962e069b': No such file or directory [root at sn-1:/root] [root at sn-1:/mnt/bricks/mstate/brick] # getfattr -m . -d -e hex . # file: . trusted.afr.dirty=0x000000000000000000000000 trusted.afr.mstate-client-0=0x000000000000000000000006 trusted.afr.mstate-client-1=0x000000000000000000000000 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x000000010000000000000000ffffffff trusted.glusterfs.volume-id=0xbf7aad0e4ce44196aa9b0a33928ea2ff [root at sn-2:/mnt/bricks/mstate/brick] # stat /mnt/bricks/mstate/brick/fstest_76f272545249be5d71359f06962e069b stat: cannot stat '/mnt/bricks/mstate/brick/fstest_76f272545249be5d71359f06962e069b': No such file or directory [root at sn-2:/mnt/bricks/mstate/brick] [root at sn-2:/mnt/bricks/mstate/brick] # getfattr -m . -d -e hex . # file: . trusted.afr.dirty=0x000000000000000000000000 trusted.afr.mstate-client-0=0x000000000000000000000006 trusted.afr.mstate-client-2=0x000000000000000000000000 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x000000010000000000000000ffffffff trusted.glusterfs.volume-id=0xbf7aad0e4ce44196aa9b0a33928ea2ff I think the entry fstest_76f272545249be5d71359f06962e069b should either be assigned gfid or be removed, from the glustershd.log it shows clearly that glustershd on sn-0 try to remove this dangling entry but meet some error. And when I do some gdb I find that in this case the entry is not assigned to gfid, because the __afr_selfheal_heal_dirent input param source is 1, so replies[source].op_ret == -1, and can not assign gfid to it. my question is in this case there is no fstest_76f272545249be5d71359f06962e069b on sn-1 and sn-2, so if want to remove this dangling entry on sn-0 can not use syncop_rmdir, I would like your opinion on this issue, thanks! Thread 12 "glustershdheal" hit Breakpoint 1, __afr_selfheal_heal_dirent (frame=0x7f5aec009350, this=0x7f5b1001d8d0, fd=0x7f5b0800c8a0, name=0x7f5b08059db0 "fstest_76f272545249be5d71359f06962e069b", inode=0x7f5aec001e80, source=1, sources=0x7f5af8fefb20 "", healed_sinks=0x7f5af8fefae0 "\001", locked_on=0x7f5af8fefac0 "\001\001\001\370Z\177", replies=0x7f5af8fef190) at afr-self-heal-entry.c:172 172 afr-self-heal-entry.c: No such file or directory. (gdb) print name $17 = 0x7f5b08059db0 "fstest_76f272545249be5d71359f06962e069b" (gdb) print source $18 = 1 (gdb) print replies[0].op_ret $19 = 0 (gdb) print replies[1].op_ret $20 = -1 (gdb) print replies[2].op_ret $21 = -1 (gdb) print replies[1].op_errno $22 = 2 When set brick point to afr_selfheal_entry_delete Thread 12 "glustershdheal" hit Breakpoint 1, afr_selfheal_entry_delete (this=0x7f5b1001d8d0, dir=0x7f5b100847f0, name=0x7f5b08059510 "fstest_76f272545249be5d71359f06962e069b", inode=0x7f5aec001650, child=0, replies=0x7f5af8fef190) at afr-self-heal-entry.c:24 24 afr-self-heal-entry.c: No such file or directory. (gdb) print uuid_utoa(inode->gfid) $1 = 0x7f5aec0022a0 "00000000-0000-0000-0000-", '0' <repeats 12 times> (gdb) print name $2 = 0x7f5b08059510 "fstest_76f272545249be5d71359f06962e069b" (gdb) bt #0 afr_selfheal_entry_delete (this=0x7f5b1001d8d0, dir=0x7f5b100847f0, name=0x7f5b08059510 "fstest_76f272545249be5d71359f06962e069b", inode=0x7f5aec001650, child=0, replies=0x7f5af8fef190) at afr-self-heal-entry.c:24 #1 0x00007f5b141e517c in __afr_selfheal_heal_dirent (frame=0x7f5aec21f470, this=0x7f5b1001d8d0, fd=0x7f5b10004510, name=0x7f5b08059510 "fstest_76f272545249be5d71359f06962e069b", inode=0x7f5aec001650, source=1, sources=0x7f5af8fefb20 "", healed_sinks=0x7f5af8fefae0 "\001", locked_on=0x7f5af8fefac0 "\001\001\001\370Z\177", replies=0x7f5af8fef190) at afr-self-heal-entry.c:201 #2 0x00007f5b141e59ab in __afr_selfheal_entry_dirent (frame=0x7f5aec21f470, this=0x7f5b1001d8d0, fd=0x7f5b10004510, name=0x7f5b08059510 "fstest_76f272545249be5d71359f06962e069b", inode=0x7f5aec001650, source=1, sources=0x7f5af8fefb20 "", healed_sinks=0x7f5af8fefae0 "\001", locked_on=0x7f5af8fefac0 "\001\001\001\370Z\177", replies=0x7f5af8fef190) at afr-self-heal-entry.c:383 #3 0x00007f5b141e63ec in afr_selfheal_entry_dirent (frame=0x7f5aec21f470, this=0x7f5b1001d8d0, fd=0x7f5b10004510, name=0x7f5b08059510 "fstest_76f272545249be5d71359f06962e069b", parent_idx_inode=0x0, subvol=0x7f5b10016f70, full_crawl=_gf_true) at afr-self-heal-entry.c:610 #4 0x00007f5b141e6a1a in afr_selfheal_entry_do_subvol (frame=0x7f5b100011c0, this=0x7f5b1001d8d0, fd=0x7f5b10004510, child=0) at afr-self-heal-entry.c:742 #5 0x00007f5b141e7207 in afr_selfheal_entry_do (frame=0x7f5b100011c0, this=0x7f5b1001d8d0, fd=0x7f5b10004510, source=1, sources=0x7f5af8ff07f0 "", healed_sinks=0x7f5af8ff07b0 "\001") at afr-self-heal-entry.c:908 #6 0x00007f5b141e7846 in __afr_selfheal_entry (frame=0x7f5b100011c0, this=0x7f5b1001d8d0, fd=0x7f5b10004510, locked_on=0x7f5af8ff0900 "\001\001\001[") at afr-self-heal-entry.c:1002 #7 0x00007f5b141e7d4a in afr_selfheal_entry (frame=0x7f5b100011c0, this=0x7f5b1001d8d0, inode=0x7f5b100847f0) at afr-self-heal-entry.c:1112 #8 0x00007f5b141df3aa in afr_selfheal_do (frame=0x7f5b100011c0, this=0x7f5b1001d8d0, gfid=0x7f5af8ff0b00 "") at afr-self-heal-common.c:2534 #9 0x00007f5b141df4a0 in afr_selfheal (this=0x7f5b1001d8d0, gfid=0x7f5af8ff0b00 "") at afr-self-heal-common.c:2575 #10 0x00007f5b141eadec in afr_shd_selfheal (healer=0x7f5b10084c30, child=0, gfid=0x7f5af8ff0b00 "") at afr-self-heald.c:343 #11 0x00007f5b141eb19b in afr_shd_index_heal (subvol=0x7f5b10016f70, entry=0x7f5b100012f0, parent=0x7f5af8ff0dc0, data=0x7f5b10084c30) at afr-self-heald.c:440 #12 0x00007f5b1a682ed3 in syncop_mt_dir_scan (frame=0x7f5b100b89e0, subvol=0x7f5b10016f70, loc=0x7f5af8ff0dc0, pid=-6, data=0x7f5b10084c30, fn=0x7f5b141eb04c <afr_shd_index_heal>, xdata=0x7f5b100b88d0, max_jobs=1, max_qlen=1024) at syncop-utils.c:407 #13 0x00007f5b141eb445 in afr_shd_index_sweep (healer=0x7f5b10084c30, vgfid=0x7f5b14213790 "glusterfs.xattrop_index_gfid") at afr-self-heald.c:494 #14 0x00007f5b141eb524 in afr_shd_index_sweep_all (healer=0x7f5b10084c30) at afr-self-heald.c:517 #15 0x00007f5b141eb827 in afr_shd_index_healer (data=0x7f5b10084c30) at afr-self-heald.c:597 #16 0x00007f5b193cd5da in start_thread () from /lib64/libpthread.so.0 #17 0x00007f5b18ca3cbf in clone () from /lib64/libc.so.6 (gdb) quit A debugging session is active. Inferior 1 [process 2230] will be detached. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20181010/5283e187/attachment-0001.html>