Li, Deqian (NSB - CN/Hangzhou)
2019-Jan-30 02:05 UTC
[Gluster-users] query about glusterfs 3.12-3 write-behind.c coredump
Hi, Could you help to check this coredump? We are using glusterfs 3.12-3(3 replicated bricks solution ) to do stability testing under high CPU load like 80% by stress and doing I/O. After several hours, coredump happened in glusterfs side . [Current thread is 1 (Thread 0x7ffff37d2700 (LWP 3696))] Missing separate debuginfos, use: dnf debuginfo-install rcp-pack-glusterfs-1.8.1_11_g99e9ca6-RCP2.wf28.x86_64 (gdb) bt #0 0x00007ffff0d5c845 in wb_fulfill (wb_inode=0x7fffd406b3b0, liabilities=0x7fffdc234b50) at write-behind.c:1148 #1 0x00007ffff0d5e4d5 in wb_process_queue (wb_inode=0x7fffd406b3b0) at write-behind.c:1718 #2 0x00007ffff0d5eda7 in wb_writev (frame=0x7fffe0086290, this=0x7fffec014b00, fd=0x7fffe4034070, vector=0x7fffdc445720, count=1, offset=67108863, flags=32770, iobref=0x7fffdc00d550, xdata=0x0) at write-behind.c:1825 #3 0x00007ffff0b51fcb in du_writev_resume (ret=0, frame=0x7fffdc0305a0, opaque=0x7fffdc0305a0) at disk-usage.c:490 #4 0x00007ffff7b3510d in synctask_wrap () at syncop.c:377 #5 0x00007ffff60d0660 in ?? () from /lib64/libc.so.6 #6 0x0000000000000000 in ?? () (gdb) p wb_inode $1 = (wb_inode_t *) 0x7fffd406b3b0 (gdb) frame 2 #2 0x00007ffff0d5eda7 in wb_writev (frame=0x7fffe0086290, this=0x7fffec014b00, fd=0x7fffe4034070, vector=0x7fffdc445720, count=1, offset=67108863, flags=32770, iobref=0x7fffdc00d550, xdata=0x0) at write-behind.c:1825 1825 in write-behind.c (gdb) p *fd $2 = {pid = 18154, flags = 32962, refcount = 0, inode_list = {next = 0x7fffe4034080, prev = 0x7fffe4034080}, inode = 0x0, lock = {spinlock = 0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = -1, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 16 times>, "\377\377\377\377", '\000' <repeats 19 times>, __align = 0}}, _ctx = 0x7fffe4022930, xl_count = 17, lk_ctx = 0x7fffe40350e0, anonymous = _gf_false} (gdb) p fd $3 = (fd_t *) 0x7fffe4034070 (gdb) p wb_inode->this $1 = (xlator_t *) 0xffffffffffffff00 After adding test log I found the FOP sequence in write-behind xlator side was mass as bellow showing. In the FUSE side the FLUSH is after write2, but in the WB side, FLUSH is between write2 'wb_do_unwinds' and 'wb_fulfill'. So I think this should has problem. I think it's possible that the FLUSH and later RELEASE operation will destroy the fd , it will cause 'wb_in->this(0xffffffffffffff00)'. Do you think so? And I think our new adding disk-usage xlator's synctask_new will dealy the write operation, but the FLUSH operation without this delay(because not invoked the disk-usage xlator). Do you agree with my speculation ? and how to fix?(we don't want to move the disk-usage xlator) Problematic FOP sequence : FUSE side: WB side: Write 1 write1 Write2 do unwind Write 2 FLUSH Release(destroy fd) FLUSH write2 (wb_fulfill) then coredump. Release int wb_fulfill (wb_inode_t *wb_inode, list_head_t *liabilities) { wb_request_t *req = NULL; wb_request_t *head = NULL; wb_request_t *tmp = NULL; wb_conf_t *conf = NULL; off_t expected_offset = 0; size_t curr_aggregate = 0; size_t vector_count = 0; int ret = 0; conf = wb_inode->this->private; --> this line coredump list_for_each_entry_safe (req, tmp, liabilities, winds) { list_del_init (&req->winds); .... volume ccs-write-behind 68: type performance/write-behind 69: subvolumes ccs-dht 70: end-volume 71: 72: volume ccs-disk-usage --> we add a new xlator here for write op ,just for checking if disk if full. And synctask_new for write. 73: type performance/disk-usage 74: subvolumes ccs-write-behind 75: end-volume 76: 77: volume ccs-read-ahead 78: type performance/read-ahead 79: subvolumes ccs-disk-usage 80: end-volume Ps. Part of Our new translator code int du_writev (call_frame_t *frame, xlator_t *this, fd_t *fd, struct iovec *vector, int count, off_t off, uint32_t flags, struct iobref *iobref, dict_t *xdata) { int op_errno = -1; int ret = -1; du_local_t *local = NULL; loc_t tmp_loc = {0,}; VALIDATE_OR_GOTO (frame, err); VALIDATE_OR_GOTO (this, err); VALIDATE_OR_GOTO (fd, err); tmp_loc.gfid[15] = 1; tmp_loc.inode = fd->inode; tmp_loc.parent = fd->inode; local = du_local_init (frame, &tmp_loc, fd, GF_FOP_WRITE); if (!local) { op_errno = ENOMEM; goto err; } local->vector = iov_dup (vector, count); local->offset = off; local->count = count; local->flags = flags; local->iobref = iobref_ref (iobref); ret = synctask_new(this->ctx->env, du_get_du_info,du_writev_resume,frame,frame); if(ret) { op_errno = -1; gf_log (this->name, GF_LOG_WARNING,"synctask_new return failure ret(%d) ",ret); goto err; } return 0; err: op_errno = (op_errno == -1) ? errno : op_errno; DU_STACK_UNWIND (writev, frame, -1, op_errno, NULL, NULL, NULL); return 0; } Br, Li Deqian -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190130/3e2901c8/attachment.html>
Raghavendra Gowdappa
2019-Jan-30 02:59 UTC
[Gluster-users] query about glusterfs 3.12-3 write-behind.c coredump
On Wed, Jan 30, 2019 at 7:35 AM Li, Deqian (NSB - CN/Hangzhou) < deqian.li at nokia-sbell.com> wrote:> Hi, > > > > Could you help to check this coredump? > > We are using glusterfs 3.12-3(3 replicated bricks solution ) to do > stability testing under high CPU load like 80% by stress and doing I/O. > > After several hours, coredump happened in glusterfs side . > > > > [Current thread is 1 (Thread 0x7ffff37d2700 (LWP 3696))] > > Missing separate debuginfos, use: dnf debuginfo-install > rcp-pack-glusterfs-1.8.1_11_g99e9ca6-RCP2.wf28.x86_64 > > (gdb) bt > > #0 0x00007ffff0d5c845 in wb_fulfill (wb_inode=0x7fffd406b3b0, > liabilities=0x7fffdc234b50) at write-behind.c:1148 > > #1 0x00007ffff0d5e4d5 in wb_process_queue (wb_inode=0x7fffd406b3b0) at > write-behind.c:1718 > > #2 0x00007ffff0d5eda7 in wb_writev (frame=0x7fffe0086290, > this=0x7fffec014b00, fd=0x7fffe4034070, vector=0x7fffdc445720, count=1, > offset=67108863, flags=32770, iobref=0x7fffdc00d550, xdata=0x0) > > at write-behind.c:1825 > > #3 0x00007ffff0b51fcb in du_writev_resume (ret=0, frame=0x7fffdc0305a0, > opaque=0x7fffdc0305a0) at disk-usage.c:490 > > #4 0x00007ffff7b3510d in synctask_wrap () at syncop.c:377 > > #5 0x00007ffff60d0660 in ?? () from /lib64/libc.so.6 > > #6 0x0000000000000000 in ?? () > > (gdb) p wb_inode > > $1 = (wb_inode_t *) 0x7fffd406b3b0 > > (gdb) frame 2 > > #2 0x00007ffff0d5eda7 in wb_writev (frame=0x7fffe0086290, > this=0x7fffec014b00, fd=0x7fffe4034070, vector=0x7fffdc445720, count=1, > offset=67108863, flags=32770, iobref=0x7fffdc00d550, xdata=0x0) > > at write-behind.c:1825 > > 1825 in write-behind.c > > (gdb) p *fd > > $2 = {pid = 18154, flags = 32962, refcount = 0, inode_list = {next > 0x7fffe4034080, prev = 0x7fffe4034080}, inode = 0x0, lock = {spinlock = 0, > mutex = {__data = {__lock = 0, __count = 0, __owner = 0, > > __nusers = 0, __kind = -1, __spins = 0, __elision = 0, __list > {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 16 times>, > "\377\377\377\377", '\000' <repeats 19 times>, __align = 0}}, > > _ctx = 0x7fffe4022930, xl_count = 17, lk_ctx = 0x7fffe40350e0, anonymous > = _gf_false} > > (gdb) p fd > > $3 = (fd_t *) 0x7fffe4034070 > > > > (gdb) p wb_inode->this > > $1 = (xlator_t *) 0xffffffffffffff00 > > > > After adding test log I found the FOP sequence in write-behind xlator > side was mass as bellow showing. In the FUSE side the FLUSH is after > write2, but in the WB side, FLUSH is between write2 ?wb_do_unwinds? and > ?wb_fulfill?. > > So I think this should has problem. I think it?s possible that the FLUSH > and later RELEASE operation will destroy the fd , it will cause > ?wb_in->this(0xffffffffffffff00)?. Do you think so? > > And I think our new adding disk-usage xlator?s synctask_new will dealy the > write operation, but the FLUSH operation without this delay(because not > invoked the disk-usage xlator). > > > > Do you agree with my speculation ? and how to fix?(we don?t want to move > the disk-usage xlator) > > > > > > Problematic FOP sequence : > > > > FUSE side: WB side: > > > > Write 1 write1 > > Write2 do unwind > > Write 2 FLUSH > > Release(destroy fd) > > FLUSH write2 (wb_fulfill) then coredump. > > Release > > > > > > int > > wb_fulfill (wb_inode_t *wb_inode, list_head_t *liabilities) > > { > > wb_request_t *req = NULL; > > wb_request_t *head = NULL; > > wb_request_t *tmp = NULL; > > wb_conf_t *conf = NULL; > > off_t expected_offset = 0; > > size_t curr_aggregate = 0; > > size_t vector_count = 0; > > int ret = 0; > > > > conf = wb_inode->this->private; ? this line coredump > > > > list_for_each_entry_safe (req, tmp, liabilities, winds) { > > list_del_init (&req->winds); > > > > ?. > > > > > > volume ccs-write-behind > > 68: type performance/write-behind > > 69: subvolumes ccs-dht > > 70: end-volume > > 71: > > * 72: volume ccs-disk-usage **? we add a new xlator > here for write op ,just for checking if disk if full. And synctask_new for > write.* > > *73: type performance/disk-usage* > > *74: subvolumes ccs-write-behind* > > *75: end-volume* > > 76: > > 77: volume ccs-read-ahead > > 78: type performance/read-ahead > > 79: subvolumes ccs-disk-usage > > 80: end-volume > > > > > > > > Ps. Part of Our new translator code > > > > int > > du_writev (call_frame_t *frame, xlator_t *this, fd_t *fd, > > struct iovec *vector, int count, off_t off, uint32_t flags, > > struct iobref *iobref, dict_t *xdata) > > { > > int op_errno = -1; > > int ret = -1; > > du_local_t *local = NULL; > > loc_t tmp_loc = {0,}; > > > > VALIDATE_OR_GOTO (frame, err); > > VALIDATE_OR_GOTO (this, err); > > VALIDATE_OR_GOTO (fd, err); > > > > tmp_loc.gfid[15] = 1; > > tmp_loc.inode = fd->inode; > > tmp_loc.parent = fd->inode; > > local = du_local_init (frame, &tmp_loc, fd, GF_FOP_WRITE); > > if (!local) { > > > > op_errno = ENOMEM; > > goto err; > > } > > local->vector = iov_dup (vector, count); > > local->offset = off; > > local->count = count; > > local->flags = flags; > > local->iobref = iobref_ref (iobref); > > > > ret = synctask_new(this->ctx->env, > du_get_du_info,du_writev_resume,frame,frame); >Can you paste the code of, * du_get_du_info * du_writev_resume if(ret)> > { > > op_errno = -1; > > gf_log (this->name, GF_LOG_WARNING,"synctask_new return > failure ret(%d) ",ret); > > goto err; > > } > > return 0; > > err: > > op_errno = (op_errno == -1) ? errno : op_errno; > > DU_STACK_UNWIND (writev, frame, -1, op_errno, NULL, NULL, NULL); > > return 0; > > } > > > > Br, > > Li Deqian >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190130/ddb7fbcf/attachment.html>
Raghavendra Gowdappa
2019-Jan-30 04:14 UTC
[Gluster-users] query about glusterfs 3.12-3 write-behind.c coredump
On Wed, Jan 30, 2019 at 7:35 AM Li, Deqian (NSB - CN/Hangzhou) < deqian.li at nokia-sbell.com> wrote:> Hi, > > > > Could you help to check this coredump? > > We are using glusterfs 3.12-3(3 replicated bricks solution ) to do > stability testing under high CPU load like 80% by stress and doing I/O. > > After several hours, coredump happened in glusterfs side . > > > > [Current thread is 1 (Thread 0x7ffff37d2700 (LWP 3696))] > > Missing separate debuginfos, use: dnf debuginfo-install > rcp-pack-glusterfs-1.8.1_11_g99e9ca6-RCP2.wf28.x86_64 > > (gdb) bt > > #0 0x00007ffff0d5c845 in wb_fulfill (wb_inode=0x7fffd406b3b0, > liabilities=0x7fffdc234b50) at write-behind.c:1148 > > #1 0x00007ffff0d5e4d5 in wb_process_queue (wb_inode=0x7fffd406b3b0) at > write-behind.c:1718 > > #2 0x00007ffff0d5eda7 in wb_writev (frame=0x7fffe0086290, > this=0x7fffec014b00, fd=0x7fffe4034070, vector=0x7fffdc445720, count=1, > offset=67108863, flags=32770, iobref=0x7fffdc00d550, xdata=0x0) > > at write-behind.c:1825 > > #3 0x00007ffff0b51fcb in du_writev_resume (ret=0, frame=0x7fffdc0305a0, > opaque=0x7fffdc0305a0) at disk-usage.c:490 > > #4 0x00007ffff7b3510d in synctask_wrap () at syncop.c:377 > > #5 0x00007ffff60d0660 in ?? () from /lib64/libc.so.6 > > #6 0x0000000000000000 in ?? () > > (gdb) p wb_inode > > $1 = (wb_inode_t *) 0x7fffd406b3b0 > > (gdb) frame 2 > > #2 0x00007ffff0d5eda7 in wb_writev (frame=0x7fffe0086290, > this=0x7fffec014b00, fd=0x7fffe4034070, vector=0x7fffdc445720, count=1, > offset=67108863, flags=32770, iobref=0x7fffdc00d550, xdata=0x0) > > at write-behind.c:1825 > > 1825 in write-behind.c > > (gdb) p *fd > > $2 = {pid = 18154, flags = 32962, refcount = 0, inode_list = {next > 0x7fffe4034080, prev = 0x7fffe4034080}, inode = 0x0, lock = {spinlock = 0, > mutex = {__data = {__lock = 0, __count = 0, __owner = 0, > > __nusers = 0, __kind = -1, __spins = 0, __elision = 0, __list > {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 16 times>, > "\377\377\377\377", '\000' <repeats 19 times>, __align = 0}}, > > _ctx = 0x7fffe4022930, xl_count = 17, lk_ctx = 0x7fffe40350e0, anonymous > = _gf_false} > > (gdb) p fd > > $3 = (fd_t *) 0x7fffe4034070 > > > > (gdb) p wb_inode->this > > $1 = (xlator_t *) 0xffffffffffffff00 > > > > After adding test log I found the FOP sequence in write-behind xlator > side was mass as bellow showing. In the FUSE side the FLUSH is after > write2, but in the WB side, FLUSH is between write2 ?wb_do_unwinds? and > ?wb_fulfill?. > > So I think this should has problem. >wb_do_unwinds the write after caching it. Once the write response is unwound, kernel issues a FLUSH. So, this is a valid sequence of operations and nothing wrong. I think it?s possible that the FLUSH and later RELEASE operation will> destroy the fd , it will cause ?wb_in->this(0xffffffffffffff00)?. Do you > think so? > > And I think our new adding disk-usage xlator?s synctask_new will dealy the > write operation, but the FLUSH operation without this delay(because not > invoked the disk-usage xlator). >Flush and release wait for completion of any on-going writes. So, unless write-behind has unwound the write, it won't see a flush. By the time write is unwound, write-behind makes sure to take references on objects it uses (like fds, iobref etc). So, I don't see a problem there.> > Do you agree with my speculation ? and how to fix?(we don?t want to move > the disk-usage xlator) >I've still not found the RCA. We can discuss about the fix once RCA is found.> > > > Problematic FOP sequence : > > > > FUSE side: WB side: > > > > Write 1 write1 > > Write2 do unwind > > Write 2 FLUSH > > Release(destroy fd) > > FLUSH write2 (wb_fulfill) then coredump. > > Release > > > > > > int > > wb_fulfill (wb_inode_t *wb_inode, list_head_t *liabilities) > > { > > wb_request_t *req = NULL; > > wb_request_t *head = NULL; > > wb_request_t *tmp = NULL; > > wb_conf_t *conf = NULL; > > off_t expected_offset = 0; > > size_t curr_aggregate = 0; > > size_t vector_count = 0; > > int ret = 0; > > > > conf = wb_inode->this->private; ? this line coredump > > > > list_for_each_entry_safe (req, tmp, liabilities, winds) { > > list_del_init (&req->winds); > > > > ?. > > > > > > volume ccs-write-behind > > 68: type performance/write-behind > > 69: subvolumes ccs-dht > > 70: end-volume > > 71: > > * 72: volume ccs-disk-usage **? we add a new xlator > here for write op ,just for checking if disk if full. And synctask_new for > write.* > > *73: type performance/disk-usage* > > *74: subvolumes ccs-write-behind* > > *75: end-volume* > > 76: > > 77: volume ccs-read-ahead > > 78: type performance/read-ahead > > 79: subvolumes ccs-disk-usage > > 80: end-volume > > > > > > > > Ps. Part of Our new translator code > > > > int > > du_writev (call_frame_t *frame, xlator_t *this, fd_t *fd, > > struct iovec *vector, int count, off_t off, uint32_t flags, > > struct iobref *iobref, dict_t *xdata) > > { > > int op_errno = -1; > > int ret = -1; > > du_local_t *local = NULL; > > loc_t tmp_loc = {0,}; > > > > VALIDATE_OR_GOTO (frame, err); > > VALIDATE_OR_GOTO (this, err); > > VALIDATE_OR_GOTO (fd, err); > > > > tmp_loc.gfid[15] = 1; > > tmp_loc.inode = fd->inode; > > tmp_loc.parent = fd->inode; > > local = du_local_init (frame, &tmp_loc, fd, GF_FOP_WRITE); > > if (!local) { > > > > op_errno = ENOMEM; > > goto err; > > } > > local->vector = iov_dup (vector, count); > > local->offset = off; > > local->count = count; > > local->flags = flags; > > local->iobref = iobref_ref (iobref); > > > > ret = synctask_new(this->ctx->env, > du_get_du_info,du_writev_resume,frame,frame); > > if(ret) > > { > > op_errno = -1; > > gf_log (this->name, GF_LOG_WARNING,"synctask_new return > failure ret(%d) ",ret); > > goto err; > > } > > return 0; > > err: > > op_errno = (op_errno == -1) ? errno : op_errno; > > DU_STACK_UNWIND (writev, frame, -1, op_errno, NULL, NULL, NULL); > > return 0; > > } > > > > Br, > > Li Deqian >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190130/759c3c6f/attachment.html>