Hi,
I have a 5-nodes GlusterFS cluster with Distributed-Replicate. There are
180 bricks in total. The OS is CentOS6.5, and GlusterFS is 3.11.0. I find
many bricks are offline when we generate some empty files and rename them.
I see xfs call trace in every node.
For example,
Nov 16 11:15:12 node10 kernel: XFS (rdc00d28p2): Internal error
xfs_trans_cancel at line 1948 of file fs/xfs/xfs_trans.c. Caller
0xffffffffa04e33f9
Nov 16 11:15:12 node10 kernel:
Nov 16 11:15:12 node10 kernel: Pid: 9939, comm: glusterfsd Tainted: G
--------------- H 2.6.32-prsys.1.1.0.13.x86_64 #1
Nov 16 11:15:12 node10 kernel: Call Trace:
Nov 16 11:15:12 node10 kernel: [<ffffffffa04c803f>] ?
xfs_error_report+0x3f/0x50 [xfs]
Nov 16 11:15:12 node10 kernel: [<ffffffffa04e33f9>] ?
xfs_rename+0x2c9/0x6c0 [xfs]
Nov 16 11:15:12 node10 kernel: [<ffffffffa04e5e39>] ?
xfs_trans_cancel+0xd9/0x100 [xfs]
Nov 16 11:15:12 node10 kernel: [<ffffffffa04e33f9>] ?
xfs_rename+0x2c9/0x6c0 [xfs]
Nov 16 11:15:12 node10 kernel: [<ffffffff811962c5>] ?
mntput_no_expire+0x25/0xb0
Nov 16 11:15:12 node10 kernel: [<ffffffffa04f5a06>] ?
xfs_vn_rename+0x66/0x70 [xfs]
Nov 16 11:15:12 node10 kernel: [<ffffffff81184580>] ?
vfs_rename+0x2a0/0x500
Nov 16 11:15:12 node10 kernel: [<ffffffff81182cd6>] ?
generic_permission+0x16/0xa0
Nov 16 11:15:12 node10 kernel: [<ffffffff811882d9>] ?
sys_renameat+0x369/0x420
Nov 16 11:15:12 node10 kernel: [<ffffffff81185f06>] ?
final_putname+0x26/0x50
Nov 16 11:15:12 node10 kernel: [<ffffffff81186189>] ? putname+0x29/0x40
Nov 16 11:15:12 node10 kernel: [<ffffffff811861f9>] ?
user_path_at+0x59/0xa0
Nov 16 11:15:12 node10 kernel: [<ffffffff8151dc79>] ?
unroll_tree_refs+0x16/0xbc
Nov 16 11:15:12 node10 kernel: [<ffffffff810d1698>] ?
audit_syscall_entry+0x2d8/0x300
Nov 16 11:15:12 node10 kernel: [<ffffffff811883ab>] ? sys_rename+0x1b/0x20
Nov 16 11:15:12 node10 kernel: [<ffffffff8100b032>] ?
system_call_fastpath+0x16/0x1b
Nov 16 11:15:12 node10 kernel: XFS (rdc00d28p2): xfs_do_force_shutdown(0x8)
called from line 1949 of file fs/xfs/xfs_trans.c. Return address
0xffffffffa04e5e52
Nov 16 11:15:12 node10 kernel: XFS (rdc00d28p2): Corruption of in-memory
data detected. Shutting down filesystem
Nov 16 11:15:12 node10 kernel: XFS (rdc00d28p2): Please umount the
filesystem and rectify the problem(s)
Nov 16 11:15:30 node10 disks-FAvUzxiL-brick[29742]: [2017-11-16
11:15:30.206208] M [MSGID: 113075]
[posix-helpers.c:1891:posix_health_check_thread_proc] 0-data-posix:
health-check failed, going down
Nov 16 11:15:30 node10 disks-FAvUzxiL-brick[29742]: [2017-11-16
11:15:30.206538] M [MSGID: 113075]
[posix-helpers.c:1908:posix_health_check_thread_proc] 0-data-posix: still
alive! -> SIGTERM
Nov 16 11:15:37 node10 kernel: XFS (sdm): xfs_log_force: error 5 returned.
Nov 16 11:16:07 node10 kernel: XFS (sdm): xfs_log_force: error 5 returned.
I think probably it's not related to the hard disk because it can be
reproduced and it occurs for different bricks. All the hard disks are new
and I don't see any low level IO error. Is it a bug related to xfs or
GlusterFS? Is there a workaround?
Thanks,
Paul
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20171116/80f06c36/attachment.html>
On Thu, Nov 16, 2017 at 6:23 AM, Paul <flypen at gmail.com> wrote:> Hi, > > I have a 5-nodes GlusterFS cluster with Distributed-Replicate. There are > 180 bricks in total. The OS is CentOS6.5, and GlusterFS is 3.11.0. I find > many bricks are offline when we generate some empty files and rename them. > I see xfs call trace in every node. > > For example, > Nov 16 11:15:12 node10 kernel: XFS (rdc00d28p2): Internal error > xfs_trans_cancel at line 1948 of file fs/xfs/xfs_trans.c. Caller > 0xffffffffa04e33f9 > Nov 16 11:15:12 node10 kernel: > Nov 16 11:15:12 node10 kernel: Pid: 9939, comm: glusterfsd Tainted: G > --------------- H 2.6.32-prsys.1.1.0.13.x86_64 #1 > Nov 16 11:15:12 node10 kernel: Call Trace: > Nov 16 11:15:12 node10 kernel: [<ffffffffa04c803f>] ? > xfs_error_report+0x3f/0x50 [xfs] > Nov 16 11:15:12 node10 kernel: [<ffffffffa04e33f9>] ? > xfs_rename+0x2c9/0x6c0 [xfs] > Nov 16 11:15:12 node10 kernel: [<ffffffffa04e5e39>] ? > xfs_trans_cancel+0xd9/0x100 [xfs] > Nov 16 11:15:12 node10 kernel: [<ffffffffa04e33f9>] ? > xfs_rename+0x2c9/0x6c0 [xfs] > Nov 16 11:15:12 node10 kernel: [<ffffffff811962c5>] ? > mntput_no_expire+0x25/0xb0 > Nov 16 11:15:12 node10 kernel: [<ffffffffa04f5a06>] ? > xfs_vn_rename+0x66/0x70 [xfs] > Nov 16 11:15:12 node10 kernel: [<ffffffff81184580>] ? > vfs_rename+0x2a0/0x500 > Nov 16 11:15:12 node10 kernel: [<ffffffff81182cd6>] ? > generic_permission+0x16/0xa0 > Nov 16 11:15:12 node10 kernel: [<ffffffff811882d9>] ? > sys_renameat+0x369/0x420 > Nov 16 11:15:12 node10 kernel: [<ffffffff81185f06>] ? > final_putname+0x26/0x50 > Nov 16 11:15:12 node10 kernel: [<ffffffff81186189>] ? putname+0x29/0x40 > Nov 16 11:15:12 node10 kernel: [<ffffffff811861f9>] ? > user_path_at+0x59/0xa0 > Nov 16 11:15:12 node10 kernel: [<ffffffff8151dc79>] ? > unroll_tree_refs+0x16/0xbc > Nov 16 11:15:12 node10 kernel: [<ffffffff810d1698>] ? > audit_syscall_entry+0x2d8/0x300 > Nov 16 11:15:12 node10 kernel: [<ffffffff811883ab>] ? sys_rename+0x1b/0x20 > Nov 16 11:15:12 node10 kernel: [<ffffffff8100b032>] ? > system_call_fastpath+0x16/0x1b > Nov 16 11:15:12 node10 kernel: XFS (rdc00d28p2): > xfs_do_force_shutdown(0x8) called from line 1949 of file > fs/xfs/xfs_trans.c. Return address = 0xffffffffa04e5e52 > Nov 16 11:15:12 node10 kernel: XFS (rdc00d28p2): Corruption of in-memory > data detected. Shutting down filesystem > Nov 16 11:15:12 node10 kernel: XFS (rdc00d28p2): Please umount the > filesystem and rectify the problem(s) > Nov 16 11:15:30 node10 disks-FAvUzxiL-brick[29742]: [2017-11-16 > 11:15:30.206208] M [MSGID: 113075] [posix-helpers.c:1891:posix_health_check_thread_proc] > 0-data-posix: health-check failed, going down > Nov 16 11:15:30 node10 disks-FAvUzxiL-brick[29742]: [2017-11-16 > 11:15:30.206538] M [MSGID: 113075] [posix-helpers.c:1908:posix_health_check_thread_proc] > 0-data-posix: still alive! -> SIGTERM > Nov 16 11:15:37 node10 kernel: XFS (sdm): xfs_log_force: error 5 returned. > Nov 16 11:16:07 node10 kernel: XFS (sdm): xfs_log_force: error 5 returned. > > >As the logs indicate, xfs shut down and the posix health check feature in Gluster rendered the brick offline. You would be better off checking with the xfs community about this problem. Regards, Vijay -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20171116/e63279bf/attachment.html>
Vijay, Yes, I find it's a problem of xfs later. After upgrading xfs code, I've not seen this problem again. Thanks a lot! Paul On Fri, Nov 17, 2017 at 12:08 AM, Vijay Bellur <vbellur at redhat.com> wrote:> > > On Thu, Nov 16, 2017 at 6:23 AM, Paul <flypen at gmail.com> wrote: > >> Hi, >> >> I have a 5-nodes GlusterFS cluster with Distributed-Replicate. There are >> 180 bricks in total. The OS is CentOS6.5, and GlusterFS is 3.11.0. I find >> many bricks are offline when we generate some empty files and rename them. >> I see xfs call trace in every node. >> >> For example, >> Nov 16 11:15:12 node10 kernel: XFS (rdc00d28p2): Internal error >> xfs_trans_cancel at line 1948 of file fs/xfs/xfs_trans.c. Caller >> 0xffffffffa04e33f9 >> Nov 16 11:15:12 node10 kernel: >> Nov 16 11:15:12 node10 kernel: Pid: 9939, comm: glusterfsd Tainted: G >> --------------- H 2.6.32-prsys.1.1.0.13.x86_64 #1 >> Nov 16 11:15:12 node10 kernel: Call Trace: >> Nov 16 11:15:12 node10 kernel: [<ffffffffa04c803f>] ? >> xfs_error_report+0x3f/0x50 [xfs] >> Nov 16 11:15:12 node10 kernel: [<ffffffffa04e33f9>] ? >> xfs_rename+0x2c9/0x6c0 [xfs] >> Nov 16 11:15:12 node10 kernel: [<ffffffffa04e5e39>] ? >> xfs_trans_cancel+0xd9/0x100 [xfs] >> Nov 16 11:15:12 node10 kernel: [<ffffffffa04e33f9>] ? >> xfs_rename+0x2c9/0x6c0 [xfs] >> Nov 16 11:15:12 node10 kernel: [<ffffffff811962c5>] ? >> mntput_no_expire+0x25/0xb0 >> Nov 16 11:15:12 node10 kernel: [<ffffffffa04f5a06>] ? >> xfs_vn_rename+0x66/0x70 [xfs] >> Nov 16 11:15:12 node10 kernel: [<ffffffff81184580>] ? >> vfs_rename+0x2a0/0x500 >> Nov 16 11:15:12 node10 kernel: [<ffffffff81182cd6>] ? >> generic_permission+0x16/0xa0 >> Nov 16 11:15:12 node10 kernel: [<ffffffff811882d9>] ? >> sys_renameat+0x369/0x420 >> Nov 16 11:15:12 node10 kernel: [<ffffffff81185f06>] ? >> final_putname+0x26/0x50 >> Nov 16 11:15:12 node10 kernel: [<ffffffff81186189>] ? putname+0x29/0x40 >> Nov 16 11:15:12 node10 kernel: [<ffffffff811861f9>] ? >> user_path_at+0x59/0xa0 >> Nov 16 11:15:12 node10 kernel: [<ffffffff8151dc79>] ? >> unroll_tree_refs+0x16/0xbc >> Nov 16 11:15:12 node10 kernel: [<ffffffff810d1698>] ? >> audit_syscall_entry+0x2d8/0x300 >> Nov 16 11:15:12 node10 kernel: [<ffffffff811883ab>] ? sys_rename+0x1b/0x20 >> Nov 16 11:15:12 node10 kernel: [<ffffffff8100b032>] ? >> system_call_fastpath+0x16/0x1b >> Nov 16 11:15:12 node10 kernel: XFS (rdc00d28p2): >> xfs_do_force_shutdown(0x8) called from line 1949 of file >> fs/xfs/xfs_trans.c. Return address = 0xffffffffa04e5e52 >> Nov 16 11:15:12 node10 kernel: XFS (rdc00d28p2): Corruption of in-memory >> data detected. Shutting down filesystem >> Nov 16 11:15:12 node10 kernel: XFS (rdc00d28p2): Please umount the >> filesystem and rectify the problem(s) >> Nov 16 11:15:30 node10 disks-FAvUzxiL-brick[29742]: [2017-11-16 >> 11:15:30.206208] M [MSGID: 113075] [posix-helpers.c:1891:posix_health_check_thread_proc] >> 0-data-posix: health-check failed, going down >> Nov 16 11:15:30 node10 disks-FAvUzxiL-brick[29742]: [2017-11-16 >> 11:15:30.206538] M [MSGID: 113075] [posix-helpers.c:1908:posix_health_check_thread_proc] >> 0-data-posix: still alive! -> SIGTERM >> Nov 16 11:15:37 node10 kernel: XFS (sdm): xfs_log_force: error 5 returned. >> Nov 16 11:16:07 node10 kernel: XFS (sdm): xfs_log_force: error 5 returned. >> >> >> > > As the logs indicate, xfs shut down and the posix health check feature in > Gluster rendered the brick offline. You would be better off checking with > the xfs community about this problem. > > Regards, > Vijay >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20171123/fba37e6e/attachment.html>