thr3ads.net - Gluster users - [Gluster-users] query about glusterfs 3.12-3 write-behind.c coredump [Jan 2019]

If this information is useful, please help other people find it:
Share via:

Li, Deqian (NSB - CN/Hangzhou)

2019-Jan-30 02:05 UTC

[Gluster-users] query about glusterfs 3.12-3 write-behind.c coredump

Hi,

Could you help to check this coredump?
We are using glusterfs 3.12-3(3 replicated bricks solution ) to do stability
testing under high CPU load like 80% by stress and doing I/O.
After several hours, coredump happened in glusterfs side .

[Current thread is 1 (Thread 0x7ffff37d2700 (LWP 3696))]
Missing separate debuginfos, use: dnf debuginfo-install
rcp-pack-glusterfs-1.8.1_11_g99e9ca6-RCP2.wf28.x86_64
(gdb) bt
#0  0x00007ffff0d5c845 in wb_fulfill (wb_inode=0x7fffd406b3b0,
liabilities=0x7fffdc234b50) at write-behind.c:1148
#1  0x00007ffff0d5e4d5 in wb_process_queue (wb_inode=0x7fffd406b3b0) at
write-behind.c:1718
#2  0x00007ffff0d5eda7 in wb_writev (frame=0x7fffe0086290, this=0x7fffec014b00,
fd=0x7fffe4034070, vector=0x7fffdc445720, count=1, offset=67108863, flags=32770,
iobref=0x7fffdc00d550, xdata=0x0)
    at write-behind.c:1825
#3  0x00007ffff0b51fcb in du_writev_resume (ret=0, frame=0x7fffdc0305a0,
opaque=0x7fffdc0305a0) at disk-usage.c:490
#4  0x00007ffff7b3510d in synctask_wrap () at syncop.c:377
#5  0x00007ffff60d0660 in ?? () from /lib64/libc.so.6
#6  0x0000000000000000 in ?? ()
(gdb) p wb_inode
$1 = (wb_inode_t *) 0x7fffd406b3b0
(gdb) frame 2
#2  0x00007ffff0d5eda7 in wb_writev (frame=0x7fffe0086290, this=0x7fffec014b00,
fd=0x7fffe4034070, vector=0x7fffdc445720, count=1, offset=67108863, flags=32770,
iobref=0x7fffdc00d550, xdata=0x0)
    at write-behind.c:1825
1825         in write-behind.c
(gdb) p *fd
$2 = {pid = 18154, flags = 32962, refcount = 0, inode_list = {next =
0x7fffe4034080, prev = 0x7fffe4034080}, inode = 0x0, lock = {spinlock = 0, mutex
= {__data = {__lock = 0, __count = 0, __owner = 0,
        __nusers = 0, __kind = -1, __spins = 0, __elision = 0, __list = {__prev
= 0x0, __next = 0x0}}, __size = '\000' <repeats 16 times>,
"\377\377\377\377", '\000' <repeats 19 times>, __align =
0}},
  _ctx = 0x7fffe4022930, xl_count = 17, lk_ctx = 0x7fffe40350e0, anonymous =
_gf_false}
(gdb) p fd
$3 = (fd_t *) 0x7fffe4034070

(gdb) p wb_inode->this
$1 = (xlator_t *) 0xffffffffffffff00

After adding test log I found the FOP  sequence in write-behind xlator side was
mass as bellow showing.  In the FUSE side the FLUSH is after write2, but in the
WB side, FLUSH is between write2 'wb_do_unwinds' and
'wb_fulfill'.
So I think this should has problem. I think it's possible that the FLUSH and
later RELEASE operation will destroy the fd , it will cause
'wb_in->this(0xffffffffffffff00)'. Do you think so?
And I think our new adding disk-usage xlator's synctask_new will dealy the
write operation, but the FLUSH operation without this delay(because not invoked
the disk-usage xlator).

Do you agree with my speculation ? and how to fix?(we don't want to move the
disk-usage xlator)


Problematic FOP sequence :

FUSE side:             WB side:

Write 1                write1
Write2 do unwind
Write 2                FLUSH
                      Release(destroy fd)
FLUSH                write2  (wb_fulfill)  then coredump.
Release


int
wb_fulfill (wb_inode_t *wb_inode, list_head_t *liabilities)
{
         wb_request_t  *req     = NULL;
         wb_request_t  *head    = NULL;
         wb_request_t  *tmp     = NULL;
         wb_conf_t     *conf    = NULL;
         off_t          expected_offset = 0;
         size_t         curr_aggregate = 0;
         size_t         vector_count = 0;
        int            ret          = 0;

         conf = wb_inode->this->private;   --> this line coredump

         list_for_each_entry_safe (req, tmp, liabilities, winds) {
                  list_del_init (&req->winds);

....


volume ccs-write-behind
68:     type performance/write-behind
69:     subvolumes ccs-dht
70: end-volume
71:
 72: volume ccs-disk-usage                      --> we add a new xlator here
for write op ,just for checking if disk if full.  And synctask_new for write.
73:     type performance/disk-usage
74:     subvolumes ccs-write-behind
75: end-volume
76:
 77: volume ccs-read-ahead
78:     type performance/read-ahead
79:     subvolumes ccs-disk-usage
80: end-volume



Ps. Part of  Our new translator code

int
du_writev (call_frame_t *frame, xlator_t *this, fd_t *fd,
            struct iovec *vector, int count, off_t off, uint32_t flags,
            struct iobref *iobref, dict_t *xdata)
{
    int           op_errno = -1;
    int           ret = -1;
    du_local_t  *local = NULL;
    loc_t          tmp_loc      = {0,};

    VALIDATE_OR_GOTO (frame, err);
    VALIDATE_OR_GOTO (this, err);
    VALIDATE_OR_GOTO (fd, err);

    tmp_loc.gfid[15] = 1;
    tmp_loc.inode = fd->inode;
    tmp_loc.parent = fd->inode;
    local = du_local_init (frame, &tmp_loc, fd, GF_FOP_WRITE);
    if (!local) {

            op_errno = ENOMEM;
            goto err;
    }
    local->vector = iov_dup (vector, count);
    local->offset = off;
    local->count = count;
    local->flags = flags;
    local->iobref = iobref_ref (iobref);

    ret = synctask_new(this->ctx->env,
du_get_du_info,du_writev_resume,frame,frame);
    if(ret)
    {
            op_errno = -1;
            gf_log (this->name, GF_LOG_WARNING,"synctask_new return
failure ret(%d)  ",ret);
            goto err;
    }
    return 0;
err:
    op_errno = (op_errno == -1) ? errno : op_errno;
    DU_STACK_UNWIND (writev, frame, -1, op_errno, NULL, NULL, NULL);
    return 0;
}

Br,
Li Deqian
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20190130/3e2901c8/attachment.html>

Raghavendra Gowdappa

2019-Jan-30 02:59 UTC

head link

[Gluster-users] query about glusterfs 3.12-3 write-behind.c coredump

On Wed, Jan 30, 2019 at 7:35 AM Li, Deqian (NSB - CN/Hangzhou) <
deqian.li at nokia-sbell.com> wrote:
> Hi,
>
>
>
> Could you help to check this coredump?
>
> We are using glusterfs 3.12-3(3 replicated bricks solution ) to do
> stability testing under high CPU load like 80% by stress and doing I/O.
>
> After several hours, coredump happened in glusterfs side .
>
>
>
> [Current thread is 1 (Thread 0x7ffff37d2700 (LWP 3696))]
>
> Missing separate debuginfos, use: dnf debuginfo-install
> rcp-pack-glusterfs-1.8.1_11_g99e9ca6-RCP2.wf28.x86_64
>
> (gdb) bt
>
> #0  0x00007ffff0d5c845 in wb_fulfill (wb_inode=0x7fffd406b3b0,
> liabilities=0x7fffdc234b50) at write-behind.c:1148
>
> #1  0x00007ffff0d5e4d5 in wb_process_queue (wb_inode=0x7fffd406b3b0) at
> write-behind.c:1718
>
> #2  0x00007ffff0d5eda7 in wb_writev (frame=0x7fffe0086290,
> this=0x7fffec014b00, fd=0x7fffe4034070, vector=0x7fffdc445720, count=1,
> offset=67108863, flags=32770, iobref=0x7fffdc00d550, xdata=0x0)
>
>     at write-behind.c:1825
>
> #3  0x00007ffff0b51fcb in du_writev_resume (ret=0, frame=0x7fffdc0305a0,
> opaque=0x7fffdc0305a0) at disk-usage.c:490
>
> #4  0x00007ffff7b3510d in synctask_wrap () at syncop.c:377
>
> #5  0x00007ffff60d0660 in ?? () from /lib64/libc.so.6
>
> #6  0x0000000000000000 in ?? ()
>
> (gdb) p wb_inode
>
> $1 = (wb_inode_t *) 0x7fffd406b3b0
>
> (gdb) frame 2
>
> #2  0x00007ffff0d5eda7 in wb_writev (frame=0x7fffe0086290,
> this=0x7fffec014b00, fd=0x7fffe4034070, vector=0x7fffdc445720, count=1,
> offset=67108863, flags=32770, iobref=0x7fffdc00d550, xdata=0x0)
>
>     at write-behind.c:1825
>
> 1825         in write-behind.c
>
> (gdb) p *fd
>
> $2 = {pid = 18154, flags = 32962, refcount = 0, inode_list = {next >
0x7fffe4034080, prev = 0x7fffe4034080}, inode = 0x0, lock = {spinlock = 0,
> mutex = {__data = {__lock = 0, __count = 0, __owner = 0,
>
>         __nusers = 0, __kind = -1, __spins = 0, __elision = 0, __list >
{__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 16 times>,
> "\377\377\377\377", '\000' <repeats 19 times>,
__align = 0}},
>
>   _ctx = 0x7fffe4022930, xl_count = 17, lk_ctx = 0x7fffe40350e0, anonymous
> = _gf_false}
>
> (gdb) p fd
>
> $3 = (fd_t *) 0x7fffe4034070
>
>
>
> (gdb) p wb_inode->this
>
> $1 = (xlator_t *) 0xffffffffffffff00
>
>
>
> After adding test log I found the FOP  sequence in write-behind xlator
> side was mass as bellow showing.  In the FUSE side the FLUSH is after
> write2, but in the WB side, FLUSH is between write2 ?wb_do_unwinds? and
> ?wb_fulfill?.
>
> So I think this should has problem. I think it?s possible that the FLUSH
> and later RELEASE operation will destroy the fd , it will cause
> ?wb_in->this(0xffffffffffffff00)?. Do you think so?
>
> And I think our new adding disk-usage xlator?s synctask_new will dealy the
> write operation, but the FLUSH operation without this delay(because not
> invoked the disk-usage xlator).
>
>
>
> Do you agree with my speculation ? and how to fix?(we don?t want to move
> the disk-usage xlator)
>
>
>
>
>
> Problematic FOP sequence :
>
>
>
> FUSE side:             WB side:
>
>
>
> Write 1                write1
>
> Write2 do unwind
>
> Write 2                FLUSH
>
>                       Release(destroy fd)
>
> FLUSH                write2  (wb_fulfill)  then coredump.
>
> Release
>
>
>
>
>
> int
>
> wb_fulfill (wb_inode_t *wb_inode, list_head_t *liabilities)
>
> {
>
>          wb_request_t  *req     = NULL;
>
>          wb_request_t  *head    = NULL;
>
>          wb_request_t  *tmp     = NULL;
>
>          wb_conf_t     *conf    = NULL;
>
>          off_t          expected_offset = 0;
>
>          size_t         curr_aggregate = 0;
>
>          size_t         vector_count = 0;
>
>         int            ret          = 0;
>
>
>
>          conf = wb_inode->this->private;   ? this line coredump
>
>
>
>          list_for_each_entry_safe (req, tmp, liabilities, winds) {
>
>                   list_del_init (&req->winds);
>
>
>
> ?.
>
>
>
>
>
> volume ccs-write-behind
>
> 68:     type performance/write-behind
>
> 69:     subvolumes ccs-dht
>
> 70: end-volume
>
> 71:
>
> * 72: volume ccs-disk-usage                      **? we add a new xlator
> here for write op ,just for checking if disk if full.  And synctask_new for
> write.*
>
> *73:     type performance/disk-usage*
>
> *74:     subvolumes ccs-write-behind*
>
> *75: end-volume*
>
> 76:
>
>  77: volume ccs-read-ahead
>
> 78:     type performance/read-ahead
>
> 79:     subvolumes ccs-disk-usage
>
> 80: end-volume
>
>
>
>
>
>
>
> Ps. Part of  Our new translator code
>
>
>
> int
>
> du_writev (call_frame_t *frame, xlator_t *this, fd_t *fd,
>
>             struct iovec *vector, int count, off_t off, uint32_t flags,
>
>             struct iobref *iobref, dict_t *xdata)
>
> {
>
>     int           op_errno = -1;
>
>     int           ret = -1;
>
>     du_local_t  *local = NULL;
>
>     loc_t          tmp_loc      = {0,};
>
>
>
>     VALIDATE_OR_GOTO (frame, err);
>
>     VALIDATE_OR_GOTO (this, err);
>
>     VALIDATE_OR_GOTO (fd, err);
>
>
>
>     tmp_loc.gfid[15] = 1;
>
>     tmp_loc.inode = fd->inode;
>
>     tmp_loc.parent = fd->inode;
>
>     local = du_local_init (frame, &tmp_loc, fd, GF_FOP_WRITE);
>
>     if (!local) {
>
>
>
>             op_errno = ENOMEM;
>
>             goto err;
>
>     }
>
>     local->vector = iov_dup (vector, count);
>
>     local->offset = off;
>
>     local->count = count;
>
>     local->flags = flags;
>
>     local->iobref = iobref_ref (iobref);
>
>
>
>     ret = synctask_new(this->ctx->env,
> du_get_du_info,du_writev_resume,frame,frame);
>
Can you paste the code of,
* du_get_du_info
* du_writev_resume

    if(ret)>
>     {
>
>             op_errno = -1;
>
>             gf_log (this->name, GF_LOG_WARNING,"synctask_new return
> failure ret(%d)  ",ret);
>
>             goto err;
>
>     }
>
>     return 0;
>
> err:
>
>     op_errno = (op_errno == -1) ? errno : op_errno;
>
>     DU_STACK_UNWIND (writev, frame, -1, op_errno, NULL, NULL, NULL);
>
>     return 0;
>
> }
>
>
>
> Br,
>
> Li Deqian
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20190130/ddb7fbcf/attachment.html>

Raghavendra Gowdappa

2019-Jan-30 04:14 UTC

head link

[Gluster-users] query about glusterfs 3.12-3 write-behind.c coredump

On Wed, Jan 30, 2019 at 7:35 AM Li, Deqian (NSB - CN/Hangzhou) <
deqian.li at nokia-sbell.com> wrote:
> Hi,
>
>
>
> Could you help to check this coredump?
>
> We are using glusterfs 3.12-3(3 replicated bricks solution ) to do
> stability testing under high CPU load like 80% by stress and doing I/O.
>
> After several hours, coredump happened in glusterfs side .
>
>
>
> [Current thread is 1 (Thread 0x7ffff37d2700 (LWP 3696))]
>
> Missing separate debuginfos, use: dnf debuginfo-install
> rcp-pack-glusterfs-1.8.1_11_g99e9ca6-RCP2.wf28.x86_64
>
> (gdb) bt
>
> #0  0x00007ffff0d5c845 in wb_fulfill (wb_inode=0x7fffd406b3b0,
> liabilities=0x7fffdc234b50) at write-behind.c:1148
>
> #1  0x00007ffff0d5e4d5 in wb_process_queue (wb_inode=0x7fffd406b3b0) at
> write-behind.c:1718
>
> #2  0x00007ffff0d5eda7 in wb_writev (frame=0x7fffe0086290,
> this=0x7fffec014b00, fd=0x7fffe4034070, vector=0x7fffdc445720, count=1,
> offset=67108863, flags=32770, iobref=0x7fffdc00d550, xdata=0x0)
>
>     at write-behind.c:1825
>
> #3  0x00007ffff0b51fcb in du_writev_resume (ret=0, frame=0x7fffdc0305a0,
> opaque=0x7fffdc0305a0) at disk-usage.c:490
>
> #4  0x00007ffff7b3510d in synctask_wrap () at syncop.c:377
>
> #5  0x00007ffff60d0660 in ?? () from /lib64/libc.so.6
>
> #6  0x0000000000000000 in ?? ()
>
> (gdb) p wb_inode
>
> $1 = (wb_inode_t *) 0x7fffd406b3b0
>
> (gdb) frame 2
>
> #2  0x00007ffff0d5eda7 in wb_writev (frame=0x7fffe0086290,
> this=0x7fffec014b00, fd=0x7fffe4034070, vector=0x7fffdc445720, count=1,
> offset=67108863, flags=32770, iobref=0x7fffdc00d550, xdata=0x0)
>
>     at write-behind.c:1825
>
> 1825         in write-behind.c
>
> (gdb) p *fd
>
> $2 = {pid = 18154, flags = 32962, refcount = 0, inode_list = {next >
0x7fffe4034080, prev = 0x7fffe4034080}, inode = 0x0, lock = {spinlock = 0,
> mutex = {__data = {__lock = 0, __count = 0, __owner = 0,
>
>         __nusers = 0, __kind = -1, __spins = 0, __elision = 0, __list >
{__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 16 times>,
> "\377\377\377\377", '\000' <repeats 19 times>,
__align = 0}},
>
>   _ctx = 0x7fffe4022930, xl_count = 17, lk_ctx = 0x7fffe40350e0, anonymous
> = _gf_false}
>
> (gdb) p fd
>
> $3 = (fd_t *) 0x7fffe4034070
>
>
>
> (gdb) p wb_inode->this
>
> $1 = (xlator_t *) 0xffffffffffffff00
>
>
>
> After adding test log I found the FOP  sequence in write-behind xlator
> side was mass as bellow showing.  In the FUSE side the FLUSH is after
> write2, but in the WB side, FLUSH is between write2 ?wb_do_unwinds? and
> ?wb_fulfill?.
>
> So I think this should has problem.
>
wb_do_unwinds the write after caching it. Once the write response is
unwound, kernel issues a FLUSH. So, this is a valid sequence of operations
and nothing wrong.

I think it?s possible that the FLUSH and later RELEASE operation
will> destroy the fd , it will cause ?wb_in->this(0xffffffffffffff00)?. Do you
> think so?
>
> And I think our new adding disk-usage xlator?s synctask_new will dealy the
> write operation, but the FLUSH operation without this delay(because not
> invoked the disk-usage xlator).
>
Flush and release wait for completion of any on-going writes. So, unless
write-behind has unwound the write, it won't see a flush. By the time write
is unwound, write-behind makes sure to take references on objects it uses
(like fds, iobref etc). So, I don't see a problem there.

>
> Do you agree with my speculation ? and how to fix?(we don?t want to move
> the disk-usage xlator)
>
I've still not found the RCA. We can discuss about the fix once RCA is
found.

>
>
>
> Problematic FOP sequence :
>
>
>
> FUSE side:             WB side:
>
>
>
> Write 1                write1
>
> Write2 do unwind
>
> Write 2                FLUSH
>
>                       Release(destroy fd)
>
> FLUSH                write2  (wb_fulfill)  then coredump.
>
> Release
>
>
>
>
>
> int
>
> wb_fulfill (wb_inode_t *wb_inode, list_head_t *liabilities)
>
> {
>
>          wb_request_t  *req     = NULL;
>
>          wb_request_t  *head    = NULL;
>
>          wb_request_t  *tmp     = NULL;
>
>          wb_conf_t     *conf    = NULL;
>
>          off_t          expected_offset = 0;
>
>          size_t         curr_aggregate = 0;
>
>          size_t         vector_count = 0;
>
>         int            ret          = 0;
>
>
>
>          conf = wb_inode->this->private;   ? this line coredump
>
>
>
>          list_for_each_entry_safe (req, tmp, liabilities, winds) {
>
>                   list_del_init (&req->winds);
>
>
>
> ?.
>
>
>
>
>
> volume ccs-write-behind
>
> 68:     type performance/write-behind
>
> 69:     subvolumes ccs-dht
>
> 70: end-volume
>
> 71:
>
> * 72: volume ccs-disk-usage                      **? we add a new xlator
> here for write op ,just for checking if disk if full.  And synctask_new for
> write.*
>
> *73:     type performance/disk-usage*
>
> *74:     subvolumes ccs-write-behind*
>
> *75: end-volume*
>
> 76:
>
>  77: volume ccs-read-ahead
>
> 78:     type performance/read-ahead
>
> 79:     subvolumes ccs-disk-usage
>
> 80: end-volume
>
>
>
>
>
>
>
> Ps. Part of  Our new translator code
>
>
>
> int
>
> du_writev (call_frame_t *frame, xlator_t *this, fd_t *fd,
>
>             struct iovec *vector, int count, off_t off, uint32_t flags,
>
>             struct iobref *iobref, dict_t *xdata)
>
> {
>
>     int           op_errno = -1;
>
>     int           ret = -1;
>
>     du_local_t  *local = NULL;
>
>     loc_t          tmp_loc      = {0,};
>
>
>
>     VALIDATE_OR_GOTO (frame, err);
>
>     VALIDATE_OR_GOTO (this, err);
>
>     VALIDATE_OR_GOTO (fd, err);
>
>
>
>     tmp_loc.gfid[15] = 1;
>
>     tmp_loc.inode = fd->inode;
>
>     tmp_loc.parent = fd->inode;
>
>     local = du_local_init (frame, &tmp_loc, fd, GF_FOP_WRITE);
>
>     if (!local) {
>
>
>
>             op_errno = ENOMEM;
>
>             goto err;
>
>     }
>
>     local->vector = iov_dup (vector, count);
>
>     local->offset = off;
>
>     local->count = count;
>
>     local->flags = flags;
>
>     local->iobref = iobref_ref (iobref);
>
>
>
>     ret = synctask_new(this->ctx->env,
> du_get_du_info,du_writev_resume,frame,frame);
>
>     if(ret)
>
>     {
>
>             op_errno = -1;
>
>             gf_log (this->name, GF_LOG_WARNING,"synctask_new return
> failure ret(%d)  ",ret);
>
>             goto err;
>
>     }
>
>     return 0;
>
> err:
>
>     op_errno = (op_errno == -1) ? errno : op_errno;
>
>     DU_STACK_UNWIND (writev, frame, -1, op_errno, NULL, NULL, NULL);
>
>     return 0;
>
> }
>
>
>
> Br,
>
> Li Deqian
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20190130/759c3c6f/attachment.html>

Gluster users - Jan 2019 - query about glusterfs 3.12-3 write-behind.c coredump

[Gluster-users] query about glusterfs 3.12-3 write-behind.c coredump

[Gluster-users] query about glusterfs 3.12-3 write-behind.c coredump

[Gluster-users] query about glusterfs 3.12-3 write-behind.c coredump