Walter Deignan
2017-May-17 19:21 UTC
[Gluster-users] Deleting large files on sharded volume hangs and doesn't delete shards
I have a reproducible issue where attempting to delete a file large enough to have been sharded hangs. I can't kill the 'rm' command and eventually am forced to reboot the client (which in this case is also part of the gluster cluster). After the node finishes rebooting I can see that while the file front-end is gone, the back-end shards are still present. Is this a known issue? Any way to get around it? ---------------------------------------------- [root at dc-vihi19 ~]# gluster volume info gv0 Volume Name: gv0 Type: Tier Volume ID: d42e366f-381d-4787-bcc5-cb6770cb7d58 Status: Started Snapshot Count: 0 Number of Bricks: 24 Transport-type: tcp Hot Tier : Hot Tier Type : Distributed-Replicate Number of Bricks: 4 x 2 = 8 Brick1: dc-vihi71:/gluster/bricks/brick4/data Brick2: dc-vihi19:/gluster/bricks/brick4/data Brick3: dc-vihi70:/gluster/bricks/brick4/data Brick4: dc-vihi19:/gluster/bricks/brick3/data Brick5: dc-vihi71:/gluster/bricks/brick3/data Brick6: dc-vihi19:/gluster/bricks/brick2/data Brick7: dc-vihi70:/gluster/bricks/brick3/data Brick8: dc-vihi19:/gluster/bricks/brick1/data Cold Tier: Cold Tier Type : Distributed-Replicate Number of Bricks: 8 x 2 = 16 Brick9: dc-vihi19:/gluster/bricks/brick5/data Brick10: dc-vihi70:/gluster/bricks/brick1/data Brick11: dc-vihi19:/gluster/bricks/brick6/data Brick12: dc-vihi71:/gluster/bricks/brick1/data Brick13: dc-vihi19:/gluster/bricks/brick7/data Brick14: dc-vihi70:/gluster/bricks/brick2/data Brick15: dc-vihi19:/gluster/bricks/brick8/data Brick16: dc-vihi71:/gluster/bricks/brick2/data Brick17: dc-vihi19:/gluster/bricks/brick9/data Brick18: dc-vihi70:/gluster/bricks/brick5/data Brick19: dc-vihi19:/gluster/bricks/brick10/data Brick20: dc-vihi71:/gluster/bricks/brick5/data Brick21: dc-vihi19:/gluster/bricks/brick11/data Brick22: dc-vihi70:/gluster/bricks/brick6/data Brick23: dc-vihi19:/gluster/bricks/brick12/data Brick24: dc-vihi71:/gluster/bricks/brick6/data Options Reconfigured: nfs.disable: on transport.address-family: inet features.ctr-enabled: on cluster.tier-mode: cache features.shard: on features.shard-block-size: 512MB network.ping-timeout: 5 cluster.server-quorum-ratio: 51% [root at dc-vihi19 temp]# ls -lh total 26G -rw-rw-rw-. 1 root root 31G May 17 10:38 win7.qcow2 [root at dc-vihi19 temp]# getfattr -n glusterfs.gfid.string win7.qcow2 # file: win7.qcow2 glusterfs.gfid.string="7f4a0fea-72c0-41e4-97a5-6297be0a9142" [root at dc-vihi19 temp]# rm win7.qcow2 rm: remove regular file ?win7.qcow2?? y *Process hangs and can't be killed. A reboot later...* login as: root Authenticating with public key "rsa-key-20170510" Last login: Wed May 17 14:04:29 2017 from ****** [root at dc-vihi19 ~]# find /gluster/bricks -name "7f4a0fea-72c0-41e4-97a5-6297be0a9142*" /gluster/bricks/brick1/data/.shard/7f4a0fea-72c0-41e4-97a5-6297be0a9142.23 /gluster/bricks/brick1/data/.shard/7f4a0fea-72c0-41e4-97a5-6297be0a9142.35 /gluster/bricks/brick2/data/.shard/7f4a0fea-72c0-41e4-97a5-6297be0a9142.52 /gluster/bricks/brick2/data/.shard/7f4a0fea-72c0-41e4-97a5-6297be0a9142.29 /gluster/bricks/brick2/data/.shard/7f4a0fea-72c0-41e4-97a5-6297be0a9142.22 /gluster/bricks/brick2/data/.shard/7f4a0fea-72c0-41e4-97a5-6297be0a9142.24 and so on... -Walter Deignan -Uline IT, Systems Architect -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170517/0cf2d669/attachment.html>
Nithya Balachandran
2017-May-18 03:16 UTC
[Gluster-users] Deleting large files on sharded volume hangs and doesn't delete shards
I don't think we have tested shards with a tiered volume. Do you see such issues on non-tiered sharded volumes? Regards, Nithya On 18 May 2017 at 00:51, Walter Deignan <WDeignan at uline.com> wrote:> I have a reproducible issue where attempting to delete a file large enough > to have been sharded hangs. I can't kill the 'rm' command and eventually am > forced to reboot the client (which in this case is also part of the gluster > cluster). After the node finishes rebooting I can see that while the file > front-end is gone, the back-end shards are still present. > > Is this a known issue? Any way to get around it? > > ---------------------------------------------- > > [root at dc-vihi19 ~]# gluster volume info gv0 > > Volume Name: gv0 > Type: Tier > Volume ID: d42e366f-381d-4787-bcc5-cb6770cb7d58 > Status: Started > Snapshot Count: 0 > Number of Bricks: 24 > Transport-type: tcp > Hot Tier : > Hot Tier Type : Distributed-Replicate > Number of Bricks: 4 x 2 = 8 > Brick1: dc-vihi71:/gluster/bricks/brick4/data > Brick2: dc-vihi19:/gluster/bricks/brick4/data > Brick3: dc-vihi70:/gluster/bricks/brick4/data > Brick4: dc-vihi19:/gluster/bricks/brick3/data > Brick5: dc-vihi71:/gluster/bricks/brick3/data > Brick6: dc-vihi19:/gluster/bricks/brick2/data > Brick7: dc-vihi70:/gluster/bricks/brick3/data > Brick8: dc-vihi19:/gluster/bricks/brick1/data > Cold Tier: > Cold Tier Type : Distributed-Replicate > Number of Bricks: 8 x 2 = 16 > Brick9: dc-vihi19:/gluster/bricks/brick5/data > Brick10: dc-vihi70:/gluster/bricks/brick1/data > Brick11: dc-vihi19:/gluster/bricks/brick6/data > Brick12: dc-vihi71:/gluster/bricks/brick1/data > Brick13: dc-vihi19:/gluster/bricks/brick7/data > Brick14: dc-vihi70:/gluster/bricks/brick2/data > Brick15: dc-vihi19:/gluster/bricks/brick8/data > Brick16: dc-vihi71:/gluster/bricks/brick2/data > Brick17: dc-vihi19:/gluster/bricks/brick9/data > Brick18: dc-vihi70:/gluster/bricks/brick5/data > Brick19: dc-vihi19:/gluster/bricks/brick10/data > Brick20: dc-vihi71:/gluster/bricks/brick5/data > Brick21: dc-vihi19:/gluster/bricks/brick11/data > Brick22: dc-vihi70:/gluster/bricks/brick6/data > Brick23: dc-vihi19:/gluster/bricks/brick12/data > Brick24: dc-vihi71:/gluster/bricks/brick6/data > Options Reconfigured: > nfs.disable: on > transport.address-family: inet > features.ctr-enabled: on > cluster.tier-mode: cache > features.shard: on > features.shard-block-size: 512MB > network.ping-timeout: 5 > cluster.server-quorum-ratio: 51% > > [root at dc-vihi19 temp]# ls -lh > total 26G > -rw-rw-rw-. 1 root root 31G May 17 10:38 win7.qcow2 > [root at dc-vihi19 temp]# getfattr -n glusterfs.gfid.string win7.qcow2 > # file: win7.qcow2 > glusterfs.gfid.string="7f4a0fea-72c0-41e4-97a5-6297be0a9142" > > [root at dc-vihi19 temp]# rm win7.qcow2 > rm: remove regular file ?win7.qcow2?? y > > *Process hangs and can't be killed. A reboot later...* > > login as: root > Authenticating with public key "rsa-key-20170510" > Last login: Wed May 17 14:04:29 2017 from ****** > [root at dc-vihi19 ~]# find /gluster/bricks -name "7f4a0fea-72c0-41e4-97a5- > 6297be0a9142*" > /gluster/bricks/brick1/data/.shard/7f4a0fea-72c0-41e4-97a5-6297be0a9142.23 > /gluster/bricks/brick1/data/.shard/7f4a0fea-72c0-41e4-97a5-6297be0a9142.35 > /gluster/bricks/brick2/data/.shard/7f4a0fea-72c0-41e4-97a5-6297be0a9142.52 > /gluster/bricks/brick2/data/.shard/7f4a0fea-72c0-41e4-97a5-6297be0a9142.29 > /gluster/bricks/brick2/data/.shard/7f4a0fea-72c0-41e4-97a5-6297be0a9142.22 > /gluster/bricks/brick2/data/.shard/7f4a0fea-72c0-41e4-97a5-6297be0a9142.24 > > and so on... > > > -Walter Deignan > -Uline IT, Systems Architect > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://lists.gluster.org/mailman/listinfo/gluster-users >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170518/465d8e30/attachment.html>
Pranith Kumar Karampuri
2017-May-18 04:03 UTC
[Gluster-users] Deleting large files on sharded volume hangs and doesn't delete shards
Seems like a frame-loss. Could you collect statedump of the mount process? You may have to use kill -USR1 method described in the docs below when the process hangs. Please also get statedump of brick processes. https://gluster.readthedocs.io/en/latest/Troubleshooting/statedump/ On Thu, May 18, 2017 at 12:51 AM, Walter Deignan <WDeignan at uline.com> wrote:> I have a reproducible issue where attempting to delete a file large enough > to have been sharded hangs. I can't kill the 'rm' command and eventually am > forced to reboot the client (which in this case is also part of the gluster > cluster). After the node finishes rebooting I can see that while the file > front-end is gone, the back-end shards are still present. > > Is this a known issue? Any way to get around it? > > ---------------------------------------------- > > [root at dc-vihi19 ~]# gluster volume info gv0 > > Volume Name: gv0 > Type: Tier > Volume ID: d42e366f-381d-4787-bcc5-cb6770cb7d58 > Status: Started > Snapshot Count: 0 > Number of Bricks: 24 > Transport-type: tcp > Hot Tier : > Hot Tier Type : Distributed-Replicate > Number of Bricks: 4 x 2 = 8 > Brick1: dc-vihi71:/gluster/bricks/brick4/data > Brick2: dc-vihi19:/gluster/bricks/brick4/data > Brick3: dc-vihi70:/gluster/bricks/brick4/data > Brick4: dc-vihi19:/gluster/bricks/brick3/data > Brick5: dc-vihi71:/gluster/bricks/brick3/data > Brick6: dc-vihi19:/gluster/bricks/brick2/data > Brick7: dc-vihi70:/gluster/bricks/brick3/data > Brick8: dc-vihi19:/gluster/bricks/brick1/data > Cold Tier: > Cold Tier Type : Distributed-Replicate > Number of Bricks: 8 x 2 = 16 > Brick9: dc-vihi19:/gluster/bricks/brick5/data > Brick10: dc-vihi70:/gluster/bricks/brick1/data > Brick11: dc-vihi19:/gluster/bricks/brick6/data > Brick12: dc-vihi71:/gluster/bricks/brick1/data > Brick13: dc-vihi19:/gluster/bricks/brick7/data > Brick14: dc-vihi70:/gluster/bricks/brick2/data > Brick15: dc-vihi19:/gluster/bricks/brick8/data > Brick16: dc-vihi71:/gluster/bricks/brick2/data > Brick17: dc-vihi19:/gluster/bricks/brick9/data > Brick18: dc-vihi70:/gluster/bricks/brick5/data > Brick19: dc-vihi19:/gluster/bricks/brick10/data > Brick20: dc-vihi71:/gluster/bricks/brick5/data > Brick21: dc-vihi19:/gluster/bricks/brick11/data > Brick22: dc-vihi70:/gluster/bricks/brick6/data > Brick23: dc-vihi19:/gluster/bricks/brick12/data > Brick24: dc-vihi71:/gluster/bricks/brick6/data > Options Reconfigured: > nfs.disable: on > transport.address-family: inet > features.ctr-enabled: on > cluster.tier-mode: cache > features.shard: on > features.shard-block-size: 512MB > network.ping-timeout: 5 > cluster.server-quorum-ratio: 51% > > [root at dc-vihi19 temp]# ls -lh > total 26G > -rw-rw-rw-. 1 root root 31G May 17 10:38 win7.qcow2 > [root at dc-vihi19 temp]# getfattr -n glusterfs.gfid.string win7.qcow2 > # file: win7.qcow2 > glusterfs.gfid.string="7f4a0fea-72c0-41e4-97a5-6297be0a9142" > > [root at dc-vihi19 temp]# rm win7.qcow2 > rm: remove regular file ?win7.qcow2?? y > > *Process hangs and can't be killed. A reboot later...* > > login as: root > Authenticating with public key "rsa-key-20170510" > Last login: Wed May 17 14:04:29 2017 from ****** > [root at dc-vihi19 ~]# find /gluster/bricks -name "7f4a0fea-72c0-41e4-97a5- > 6297be0a9142*" > /gluster/bricks/brick1/data/.shard/7f4a0fea-72c0-41e4-97a5-6297be0a9142.23 > /gluster/bricks/brick1/data/.shard/7f4a0fea-72c0-41e4-97a5-6297be0a9142.35 > /gluster/bricks/brick2/data/.shard/7f4a0fea-72c0-41e4-97a5-6297be0a9142.52 > /gluster/bricks/brick2/data/.shard/7f4a0fea-72c0-41e4-97a5-6297be0a9142.29 > /gluster/bricks/brick2/data/.shard/7f4a0fea-72c0-41e4-97a5-6297be0a9142.22 > /gluster/bricks/brick2/data/.shard/7f4a0fea-72c0-41e4-97a5-6297be0a9142.24 > > and so on... > > > -Walter Deignan > -Uline IT, Systems Architect > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://lists.gluster.org/mailman/listinfo/gluster-users >-- Pranith -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170518/6d376de5/attachment.html>