thr3ads.net - Gluster users - [Gluster-users] VM going down [May 2017]

If this information is useful, please help other people find it:
Share via:

Krutika Dhananjay

2017-May-08 10:38 UTC

[Gluster-users] VM going down

The newly introduced "SEEK" fop seems to be failing at the bricks.

Adding Niels for his inputs/help.

-Krutika

On Mon, May 8, 2017 at 3:43 PM, Alessandro Briosi <ab1 at metalit.com>
wrote:
> Hi all,
> I have sporadic VM going down which files are on gluster FS.
>
> If I look at the gluster logs the only events I find are:
> /var/log/glusterfs/bricks/data-brick2-brick.log
>
> [2017-05-08 09:51:17.661697] I [MSGID: 115036]
> [server.c:548:server_rpc_notify] 0-datastore2-server: disconnecting
> connection from
> srvpve2-9074-2017/05/04-14:12:53:301448-datastore2-client-0-0-0
> [2017-05-08 09:51:17.661697] I [MSGID: 115036]
> [server.c:548:server_rpc_notify] 0-datastore2-server: disconnecting
> connection from
> srvpve2-9074-2017/05/04-14:12:53:367950-datastore2-client-0-0-0
> [2017-05-08 09:51:17.661810] W [inodelk.c:399:pl_inodelk_log_cleanup]
> 0-datastore2-server: releasing lock on
> 66d9eefb-ee55-40ad-9f44-c55d1e809006 held by {client=0x7f4c7c004880,
> pid=0 lk-owner=5c7099efc97f0000}
> [2017-05-08 09:51:17.661810] W [inodelk.c:399:pl_inodelk_log_cleanup]
> 0-datastore2-server: releasing lock on
> a8d82b3d-1cf9-45cf-9858-d8546710b49c held by {client=0x7f4c840f31d0,
> pid=0 lk-owner=5c7019fac97f0000}
> [2017-05-08 09:51:17.661835] I [MSGID: 115013]
> [server-helpers.c:293:do_fd_cleanup] 0-datastore2-server: fd cleanup on
> /images/201/vm-201-disk-2.qcow2
> [2017-05-08 09:51:17.661838] I [MSGID: 115013]
> [server-helpers.c:293:do_fd_cleanup] 0-datastore2-server: fd cleanup on
> /images/201/vm-201-disk-1.qcow2
> [2017-05-08 09:51:17.661953] I [MSGID: 101055]
> [client_t.c:415:gf_client_unref] 0-datastore2-server: Shutting down
> connection srvpve2-9074-2017/05/04-14:12:53:301448-datastore2-client-0-0-0
> [2017-05-08 09:51:17.661953] I [MSGID: 101055]
> [client_t.c:415:gf_client_unref] 0-datastore2-server: Shutting down
> connection srvpve2-9074-2017/05/04-14:12:53:367950-datastore2-client-0-0-0
> [2017-05-08 10:01:06.210392] I [MSGID: 115029]
> [server-handshake.c:692:server_setvolume] 0-datastore2-server: accepted
> client from
> srvpve2-162483-2017/05/08-10:01:06:189720-datastore2-client-0-0-0
> (version: 3.8.11)
> [2017-05-08 10:01:06.237433] E [MSGID: 113107] [posix.c:1079:posix_seek]
> 0-datastore2-posix: seek failed on fd 18 length 42957209600 [No such
> device or address]
> [2017-05-08 10:01:06.237463] E [MSGID: 115089]
> [server-rpc-fops.c:2007:server_seek_cbk] 0-datastore2-server: 18: SEEK-2
> (a8d82b3d-1cf9-45cf-9858-d8546710b49c) ==> (No such device or address)
> [No such device or address]
> [2017-05-08 10:01:07.019974] I [MSGID: 115029]
> [server-handshake.c:692:server_setvolume] 0-datastore2-server: accepted
> client from
> srvpve2-162483-2017/05/08-10:01:07:3687-datastore2-client-0-0-0
> (version: 3.8.11)
> [2017-05-08 10:01:07.041967] E [MSGID: 113107] [posix.c:1079:posix_seek]
> 0-datastore2-posix: seek failed on fd 19 length 859136720896 [No such
> device or address]
> [2017-05-08 10:01:07.041992] E [MSGID: 115089]
> [server-rpc-fops.c:2007:server_seek_cbk] 0-datastore2-server: 18: SEEK-2
> (66d9eefb-ee55-40ad-9f44-c55d1e809006) ==> (No such device or address)
> [No such device or address]
>
> The strange part is that I cannot seem to find any other error.
> If I restart the VM everything works as expected (it stopped at ~9.51
> UTC and was started at ~10.01 UTC) .
>
> This is not the first time that this happened, and I do not see any
> problems with networking or the hosts.
>
> Gluster version is 3.8.11
> this is the incriminated volume (though it happened on a different one too)
>
> Volume Name: datastore2
> Type: Replicate
> Volume ID: c95ebb5f-6e04-4f09-91b9-bbbe63d83aea
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x (2 + 1) = 3
> Transport-type: tcp
> Bricks:
> Brick1: srvpve2g:/data/brick2/brick
> Brick2: srvpve3g:/data/brick2/brick
> Brick3: srvpve1g:/data/brick2/brick (arbiter)
> Options Reconfigured:
> nfs.disable: on
> performance.readdir-ahead: on
> transport.address-family: inet
>
> Any hint on how to dig more deeply into the reason would be greatly
> appreciated.
>
> Alessandro
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20170508/5fcac5a8/attachment.html>

Jesper Led Lauridsen TS Infra server

2017-May-08 10:57 UTC

head link

[Gluster-users] VM going down

I dont know if this has any relation to you issue. But I have seen several times
during gluster healing that my wm?s fail or are marked unresponsive in rhev. My
conclusion is that the load gluster puts on the wm-images during checksum while
healing, result in to much latency and wm?s fail.

My plans is to try using sharding, so the wm-images/files are split into smaller
files, changing the number of allowed concurrent heals
?cluster.background-self-heal-count? and disabling ?cluster.self-heal-daemon?.

/Jesper

Fra: gluster-users-bounces at gluster.org [mailto:gluster-users-bounces at
gluster.org] P? vegne af Krutika Dhananjay
Sendt: 8. maj 2017 12:38
Til: Alessandro Briosi <ab1 at metalit.com>; de Vos, Niels <ndevos at
redhat.com>
Cc: gluster-users <gluster-users at gluster.org>
Emne: Re: [Gluster-users] VM going down

The newly introduced "SEEK" fop seems to be failing at the bricks.
Adding Niels for his inputs/help.

-Krutika

On Mon, May 8, 2017 at 3:43 PM, Alessandro Briosi <ab1 at
metalit.com<mailto:ab1 at metalit.com>> wrote:
Hi all,
I have sporadic VM going down which files are on gluster FS.

If I look at the gluster logs the only events I find are:
/var/log/glusterfs/bricks/data-brick2-brick.log

[2017-05-08 09:51:17.661697] I [MSGID: 115036]
[server.c:548:server_rpc_notify] 0-datastore2-server: disconnecting
connection from
srvpve2-9074-2017/05/04-14:12:53:301448-datastore2-client-0-0-0
[2017-05-08 09:51:17.661697] I [MSGID: 115036]
[server.c:548:server_rpc_notify] 0-datastore2-server: disconnecting
connection from
srvpve2-9074-2017/05/04-14:12:53:367950-datastore2-client-0-0-0
[2017-05-08 09:51:17.661810] W [inodelk.c:399:pl_inodelk_log_cleanup]
0-datastore2-server: releasing lock on
66d9eefb-ee55-40ad-9f44-c55d1e809006 held by {client=0x7f4c7c004880,
pid=0 lk-owner=5c7099efc97f0000}
[2017-05-08 09:51:17.661810] W [inodelk.c:399:pl_inodelk_log_cleanup]
0-datastore2-server: releasing lock on
a8d82b3d-1cf9-45cf-9858-d8546710b49c held by {client=0x7f4c840f31d0,
pid=0 lk-owner=5c7019fac97f0000}
[2017-05-08 09:51:17.661835] I [MSGID: 115013]
[server-helpers.c:293:do_fd_cleanup] 0-datastore2-server: fd cleanup on
/images/201/vm-201-disk-2.qcow2
[2017-05-08 09:51:17.661838] I [MSGID: 115013]
[server-helpers.c:293:do_fd_cleanup] 0-datastore2-server: fd cleanup on
/images/201/vm-201-disk-1.qcow2
[2017-05-08 09:51:17.661953] I [MSGID: 101055]
[client_t.c:415:gf_client_unref] 0-datastore2-server: Shutting down
connection srvpve2-9074-2017/05/04-14:12:53:301448-datastore2-client-0-0-0
[2017-05-08 09:51:17.661953] I [MSGID: 101055]
[client_t.c:415:gf_client_unref] 0-datastore2-server: Shutting down
connection srvpve2-9074-2017/05/04-14:12:53:367950-datastore2-client-0-0-0
[2017-05-08 10:01:06.210392] I [MSGID: 115029]
[server-handshake.c:692:server_setvolume] 0-datastore2-server: accepted
client from
srvpve2-162483-2017/05/08-10:01:06:189720-datastore2-client-0-0-0
(version: 3.8.11)
[2017-05-08 10:01:06.237433] E [MSGID: 113107] [posix.c:1079:posix_seek]
0-datastore2-posix: seek failed on fd 18 length 42957209600 [No such
device or address]
[2017-05-08 10:01:06.237463] E [MSGID: 115089]
[server-rpc-fops.c:2007:server_seek_cbk] 0-datastore2-server: 18: SEEK-2
(a8d82b3d-1cf9-45cf-9858-d8546710b49c) ==> (No such device or address)
[No such device or address]
[2017-05-08 10:01:07.019974] I [MSGID: 115029]
[server-handshake.c:692:server_setvolume] 0-datastore2-server: accepted
client from
srvpve2-162483-2017/05/08-10:01:07:3687-datastore2-client-0-0-0
(version: 3.8.11)
[2017-05-08 10:01:07.041967] E [MSGID: 113107] [posix.c:1079:posix_seek]
0-datastore2-posix: seek failed on fd 19 length 859136720896 [No such
device or address]
[2017-05-08 10:01:07.041992] E [MSGID: 115089]
[server-rpc-fops.c:2007:server_seek_cbk] 0-datastore2-server: 18: SEEK-2
(66d9eefb-ee55-40ad-9f44-c55d1e809006) ==> (No such device or address)
[No such device or address]

The strange part is that I cannot seem to find any other error.
If I restart the VM everything works as expected (it stopped at ~9.51
UTC and was started at ~10.01 UTC) .

This is not the first time that this happened, and I do not see any
problems with networking or the hosts.

Gluster version is 3.8.11
this is the incriminated volume (though it happened on a different one too)

Volume Name: datastore2
Type: Replicate
Volume ID: c95ebb5f-6e04-4f09-91b9-bbbe63d83aea
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: srvpve2g:/data/brick2/brick
Brick2: srvpve3g:/data/brick2/brick
Brick3: srvpve1g:/data/brick2/brick (arbiter)
Options Reconfigured:
nfs.disable: on
performance.readdir-ahead: on
transport.address-family: inet

Any hint on how to dig more deeply into the reason would be greatly
appreciated.

Alessandro
_______________________________________________
Gluster-users mailing list
Gluster-users at gluster.org<mailto:Gluster-users at gluster.org>
http://lists.gluster.org/mailman/listinfo/gluster-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20170508/4568412d/attachment.html>

Alessandro Briosi

2017-May-08 13:52 UTC

head link

[Gluster-users] VM going down

Il 08/05/2017 12:38, Krutika Dhananjay ha scritto:> The newly introduced "SEEK" fop seems to be failing at the
bricks.
>
> Adding Niels for his inputs/help.
>
Don't know if this is related though the SEEK is done only when the VM
is started, not when it's suddenly shutdown.
Though it's an odd message (as the file really is there), the VM starts
correctly.

Alessandro

Niels de Vos

2017-May-09 14:10 UTC

head link

[Gluster-users] VM going down

...> > client from
> > srvpve2-162483-2017/05/08-10:01:06:189720-datastore2-client-0-0-0
> > (version: 3.8.11)
> > [2017-05-08 10:01:06.237433] E [MSGID: 113107]
[posix.c:1079:posix_seek]
> > 0-datastore2-posix: seek failed on fd 18 length 42957209600 [No such
> > device or address]
The SEEK procedure translates to lseek() in the posix xlator. This can
return with "No suck device or address" (ENXIO) in only one case:

    ENXIO    whence is SEEK_DATA or SEEK_HOLE, and the file offset is
             beyond the end of the file.

This means that an lseek() was executed where the current offset of the
filedescriptor was higher than the size of the file. I'm not sure how
that could happen... Sharding prevents using SEEK at all atm.

...> > The strange part is that I cannot seem to find any other error.
> > If I restart the VM everything works as expected (it stopped at ~9.51
> > UTC and was started at ~10.01 UTC) .
> >
> > This is not the first time that this happened, and I do not see any
> > problems with networking or the hosts.
> >
> > Gluster version is 3.8.11
> > this is the incriminated volume (though it happened on a different one
too)
> >
> > Volume Name: datastore2
> > Type: Replicate
> > Volume ID: c95ebb5f-6e04-4f09-91b9-bbbe63d83aea
> > Status: Started
> > Snapshot Count: 0
> > Number of Bricks: 1 x (2 + 1) = 3
> > Transport-type: tcp
> > Bricks:
> > Brick1: srvpve2g:/data/brick2/brick
> > Brick2: srvpve3g:/data/brick2/brick
> > Brick3: srvpve1g:/data/brick2/brick (arbiter)
> > Options Reconfigured:
> > nfs.disable: on
> > performance.readdir-ahead: on
> > transport.address-family: inet
> >
> > Any hint on how to dig more deeply into the reason would be greatly
> > appreciated.
Probably the problem is with SEEK support in the arbiter functionality.
Just like with a READ or a WRITE on the arbiter brick, SEEK can only
succeed on bricks where the files with content are located. It does not
look like arbiter handles SEEK, so the offset in lseek() will likely be
higher than the size of the file on the brick (empty, 0 size file). I
don't know how the replication xlator responds on an error return from
SEEK on one of the bricks, but I doubt it likes it.

We have https://bugzilla.redhat.com/show_bug.cgi?id=1301647 to support
SEEK for sharding. I suggest you open a bug for getting SEEK in the
arbiter xlator as well.

HTH,
Niels
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: not available
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20170509/bfe1505e/attachment.sig>

Gluster users - May 2017 - VM going down

[Gluster-users] VM going down

[Gluster-users] VM going down

[Gluster-users] VM going down

[Gluster-users] VM going down