Hi all, I'm encountering a lockup problem many times when reading/writing large numbers of files. I cannot break out of the race in gdb, a ps will lock up when it tries to read that process' data, df (of course) locks up. No kill signals have any effect. Except 'pidstat -p ALL' can get the pid, I could't do anything. The only way out of it is to umount -f. I am using gluster 3.2.6 on CentOS 6.0 (2.6.32-71.el6.x86_64). The problem is the same as BUG 764964 (https://bugzilla.redhat.com/show_bug.cgi?id=764964). and it is difficult to duplicate, I am find a way to produce it quickly. Any one else also encountered this problem? How do you solve it? Attached dmesg log: May 10 00:01:52 PPC-002 kernel: INFO: task glusterfs:27888 blocked for more than 120 seconds. May 10 00:01:52 PPC-002 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 10 00:01:52 PPC-002 kernel: glusterfs D ffff88033fc24b00 0 27888 1 0x00000080 May 10 00:01:52 PPC-002 kernel: ffff8806310fbe48 0000000000000086 0000000000000000 ffff8806310fbc58 May 10 00:01:52 PPC-002 kernel: ffff8806310fbdc8 0000000000020010 ffff8806310fbee8 00000001021f184a May 10 00:01:52 PPC-002 kernel: ffff8806311a0678 ffff8806310fbfd8 0000000000010518 ffff8806311a0678 May 10 00:01:52 PPC-002 kernel: Call Trace: May 10 00:01:52 PPC-002 kernel: [<ffffffff814ca6b5>] rwsem_down_failed_common+0x95/0x1d0 May 10 00:01:52 PPC-002 kernel: [<ffffffff814ca813>] rwsem_down_write_failed+0x23/0x30 May 10 00:01:52 PPC-002 kernel: [<ffffffff81264253>] call_rwsem_down_write_failed+0x13/0x20 May 10 00:01:52 PPC-002 kernel: [<ffffffff814c9d12>] ? down_write+0x32/0x40 May 10 00:01:52 PPC-002 kernel: [<ffffffff8113b468>] sys_munmap+0x48/0x80 May 10 00:01:52 PPC-002 kernel: [<ffffffff81013172>] system_call_fastpath+0x16/0x1b Thank you in advance. Yaodong 2012-05-10 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20120510/667d7847/attachment.html>
Can you explain where glusterfs is being used? Is this lockup happening on a VM running in on a file-disk-image on top of gluster? is gluster itself causing this timeout? On Wed, May 9, 2012 at 6:59 PM, chyd <chyd at ihep.ac.cn> wrote:> Hi all, > > I'm encountering a lockup problem many times when reading/writing large > numbers of files. I cannot break out of the race in gdb, a ps will lock up > when it tries to read that process' data, df (of course) locks up. No kill > signals have any effect. Except 'pidstat -p ALL' can get the pid, I could't > do anything. The only way out of it is to umount -f. > I am using gluster 3.2.6 on CentOS 6.0 (2.6.32-71.el6.x86_64). > > The problem is the same as BUG 764964 > (https://bugzilla.redhat.com/show_bug.cgi?id=764964). and it is difficult to > duplicate, I am find a way to produce it quickly. Any one else also > encountered this problem? How do you solve it? > > Attached dmesg log: > May 10 00:01:52 PPC-002 kernel: INFO: task glusterfs:27888 blocked for more > than 120 seconds. > May 10 00:01:52 PPC-002 kernel: "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > May 10 00:01:52 PPC-002 kernel: glusterfs????????? D ffff88033fc24b00???? 0 > 27888????? 1 0x00000080 > May 10 00:01:52 PPC-002 kernel: ffff8806310fbe48 0000000000000086 > 0000000000000000 ffff8806310fbc58 > May 10 00:01:52 PPC-002 kernel: ffff8806310fbdc8 0000000000020010 > ffff8806310fbee8 00000001021f184a > May 10 00:01:52 PPC-002 kernel: ffff8806311a0678 ffff8806310fbfd8 > 0000000000010518 ffff8806311a0678 > May 10 00:01:52 PPC-002 kernel: Call Trace: > May 10 00:01:52 PPC-002 kernel: [<ffffffff814ca6b5>] > rwsem_down_failed_common+0x95/0x1d0 > May 10 00:01:52 PPC-002 kernel: [<ffffffff814ca813>] > rwsem_down_write_failed+0x23/0x30 > May 10 00:01:52 PPC-002 kernel: [<ffffffff81264253>] > call_rwsem_down_write_failed+0x13/0x20 > May 10 00:01:52 PPC-002 kernel: [<ffffffff814c9d12>] ? down_write+0x32/0x40 > May 10 00:01:52 PPC-002 kernel: [<ffffffff8113b468>] sys_munmap+0x48/0x80 > May 10 00:01:52 PPC-002 kernel: [<ffffffff81013172>] > system_call_fastpath+0x16/0x1b > > Thank you in advance. > Yaodong > > 2012-05-10 > ________________________________ > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users >
Hi, I am using gluster in physical machine (CPU: 2 Xeon E5620, MEM: 24GB, 1Gpbs network link, Centos 6.0 linux 2.6.32-71.el6.x86_64). When reading or writing small numbers of files, the system is fine. But when too many files are accessing concurrently, the problem will occure some times. After disabling THP entirely: echo never> /sys/kernel/mm/redhat_transparent_hugepage/enabled It seems that the problem is resolved. I will continue to test it and see the results. Thank, yaodong> -----????----- > ???: "Bryan Whitehead" <driver at megahappy.net> > ????: 2012?5?11? ??? > ???: chyd <chyd at ihep.ac.cn> > ??: gluster-users <gluster-users at gluster.org> > ??: Re: [Gluster-users] BUG: 764964 (dead lock) > > Can you explain where glusterfs is being used? Is this lockup > happening on a VM running in on a file-disk-image on top of gluster? > is gluster itself causing this timeout? > > On Wed, May 9, 2012 at 6:59 PM, chyd <chyd at ihep.ac.cn> wrote: > > Hi all, > > > > I'm encountering a lockup problem many times when reading/writing large > > numbers of files. I cannot break out of the race in gdb, a ps will lock up > > when it tries to read that process' data, df (of course) locks up. No kill > > signals have any effect. Except 'pidstat -p ALL' can get the pid, I could't > > do anything. The only way out of it is to umount -f. > > I am using gluster 3.2.6 on CentOS 6.0 (2.6.32-71.el6.x86_64). > > > > The problem is the same as BUG 764964 > > (https://bugzilla.redhat.com/show_bug.cgi?id=764964). and it is difficult to > > duplicate, I am find a way to produce it quickly. Any one else also > > encountered this problem? How do you solve it? > > > > Attached dmesg log: > > May 10 00:01:52 PPC-002 kernel: INFO: task glusterfs:27888 blocked for more > > than 120 seconds. > > May 10 00:01:52 PPC-002 kernel: "echo 0 > > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > May 10 00:01:52 PPC-002 kernel: glusterfs????????? D ffff88033fc24b00???? 0 > > 27888????? 1 0x00000080 > > May 10 00:01:52 PPC-002 kernel: ffff8806310fbe48 0000000000000086 > > 0000000000000000 ffff8806310fbc58 > > May 10 00:01:52 PPC-002 kernel: ffff8806310fbdc8 0000000000020010 > > ffff8806310fbee8 00000001021f184a > > May 10 00:01:52 PPC-002 kernel: ffff8806311a0678 ffff8806310fbfd8 > > 0000000000010518 ffff8806311a0678 > > May 10 00:01:52 PPC-002 kernel: Call Trace: > > May 10 00:01:52 PPC-002 kernel: [<ffffffff814ca6b5>] > > rwsem_down_failed_common+0x95/0x1d0 > > May 10 00:01:52 PPC-002 kernel: [<ffffffff814ca813>] > > rwsem_down_write_failed+0x23/0x30 > > May 10 00:01:52 PPC-002 kernel: [<ffffffff81264253>] > > call_rwsem_down_write_failed+0x13/0x20 > > May 10 00:01:52 PPC-002 kernel: [<ffffffff814c9d12>] ? down_write+0x32/0x40 > > May 10 00:01:52 PPC-002 kernel: [<ffffffff8113b468>] sys_munmap+0x48/0x80 > > May 10 00:01:52 PPC-002 kernel: [<ffffffff81013172>] > > system_call_fastpath+0x16/0x1b > > > > Thank you in advance. > > Yaodong > > > > 2012-05-10 > > ________________________________ > > > > _______________________________________________ > > Gluster-users mailing list > > Gluster-users at gluster.org > > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users > >