The newly introduced "SEEK" fop seems to be failing at the bricks. Adding Niels for his inputs/help. -Krutika On Mon, May 8, 2017 at 3:43 PM, Alessandro Briosi <ab1 at metalit.com> wrote:> Hi all, > I have sporadic VM going down which files are on gluster FS. > > If I look at the gluster logs the only events I find are: > /var/log/glusterfs/bricks/data-brick2-brick.log > > [2017-05-08 09:51:17.661697] I [MSGID: 115036] > [server.c:548:server_rpc_notify] 0-datastore2-server: disconnecting > connection from > srvpve2-9074-2017/05/04-14:12:53:301448-datastore2-client-0-0-0 > [2017-05-08 09:51:17.661697] I [MSGID: 115036] > [server.c:548:server_rpc_notify] 0-datastore2-server: disconnecting > connection from > srvpve2-9074-2017/05/04-14:12:53:367950-datastore2-client-0-0-0 > [2017-05-08 09:51:17.661810] W [inodelk.c:399:pl_inodelk_log_cleanup] > 0-datastore2-server: releasing lock on > 66d9eefb-ee55-40ad-9f44-c55d1e809006 held by {client=0x7f4c7c004880, > pid=0 lk-owner=5c7099efc97f0000} > [2017-05-08 09:51:17.661810] W [inodelk.c:399:pl_inodelk_log_cleanup] > 0-datastore2-server: releasing lock on > a8d82b3d-1cf9-45cf-9858-d8546710b49c held by {client=0x7f4c840f31d0, > pid=0 lk-owner=5c7019fac97f0000} > [2017-05-08 09:51:17.661835] I [MSGID: 115013] > [server-helpers.c:293:do_fd_cleanup] 0-datastore2-server: fd cleanup on > /images/201/vm-201-disk-2.qcow2 > [2017-05-08 09:51:17.661838] I [MSGID: 115013] > [server-helpers.c:293:do_fd_cleanup] 0-datastore2-server: fd cleanup on > /images/201/vm-201-disk-1.qcow2 > [2017-05-08 09:51:17.661953] I [MSGID: 101055] > [client_t.c:415:gf_client_unref] 0-datastore2-server: Shutting down > connection srvpve2-9074-2017/05/04-14:12:53:301448-datastore2-client-0-0-0 > [2017-05-08 09:51:17.661953] I [MSGID: 101055] > [client_t.c:415:gf_client_unref] 0-datastore2-server: Shutting down > connection srvpve2-9074-2017/05/04-14:12:53:367950-datastore2-client-0-0-0 > [2017-05-08 10:01:06.210392] I [MSGID: 115029] > [server-handshake.c:692:server_setvolume] 0-datastore2-server: accepted > client from > srvpve2-162483-2017/05/08-10:01:06:189720-datastore2-client-0-0-0 > (version: 3.8.11) > [2017-05-08 10:01:06.237433] E [MSGID: 113107] [posix.c:1079:posix_seek] > 0-datastore2-posix: seek failed on fd 18 length 42957209600 [No such > device or address] > [2017-05-08 10:01:06.237463] E [MSGID: 115089] > [server-rpc-fops.c:2007:server_seek_cbk] 0-datastore2-server: 18: SEEK-2 > (a8d82b3d-1cf9-45cf-9858-d8546710b49c) ==> (No such device or address) > [No such device or address] > [2017-05-08 10:01:07.019974] I [MSGID: 115029] > [server-handshake.c:692:server_setvolume] 0-datastore2-server: accepted > client from > srvpve2-162483-2017/05/08-10:01:07:3687-datastore2-client-0-0-0 > (version: 3.8.11) > [2017-05-08 10:01:07.041967] E [MSGID: 113107] [posix.c:1079:posix_seek] > 0-datastore2-posix: seek failed on fd 19 length 859136720896 [No such > device or address] > [2017-05-08 10:01:07.041992] E [MSGID: 115089] > [server-rpc-fops.c:2007:server_seek_cbk] 0-datastore2-server: 18: SEEK-2 > (66d9eefb-ee55-40ad-9f44-c55d1e809006) ==> (No such device or address) > [No such device or address] > > The strange part is that I cannot seem to find any other error. > If I restart the VM everything works as expected (it stopped at ~9.51 > UTC and was started at ~10.01 UTC) . > > This is not the first time that this happened, and I do not see any > problems with networking or the hosts. > > Gluster version is 3.8.11 > this is the incriminated volume (though it happened on a different one too) > > Volume Name: datastore2 > Type: Replicate > Volume ID: c95ebb5f-6e04-4f09-91b9-bbbe63d83aea > Status: Started > Snapshot Count: 0 > Number of Bricks: 1 x (2 + 1) = 3 > Transport-type: tcp > Bricks: > Brick1: srvpve2g:/data/brick2/brick > Brick2: srvpve3g:/data/brick2/brick > Brick3: srvpve1g:/data/brick2/brick (arbiter) > Options Reconfigured: > nfs.disable: on > performance.readdir-ahead: on > transport.address-family: inet > > Any hint on how to dig more deeply into the reason would be greatly > appreciated. > > Alessandro > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://lists.gluster.org/mailman/listinfo/gluster-users >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170508/5fcac5a8/attachment.html>
I dont know if this has any relation to you issue. But I have seen several times
during gluster healing that my wm?s fail or are marked unresponsive in rhev. My
conclusion is that the load gluster puts on the wm-images during checksum while
healing, result in to much latency and wm?s fail.
My plans is to try using sharding, so the wm-images/files are split into smaller
files, changing the number of allowed concurrent heals
?cluster.background-self-heal-count? and disabling ?cluster.self-heal-daemon?.
/Jesper
Fra: gluster-users-bounces at gluster.org [mailto:gluster-users-bounces at
gluster.org] P? vegne af Krutika Dhananjay
Sendt: 8. maj 2017 12:38
Til: Alessandro Briosi <ab1 at metalit.com>; de Vos, Niels <ndevos at
redhat.com>
Cc: gluster-users <gluster-users at gluster.org>
Emne: Re: [Gluster-users] VM going down
The newly introduced "SEEK" fop seems to be failing at the bricks.
Adding Niels for his inputs/help.
-Krutika
On Mon, May 8, 2017 at 3:43 PM, Alessandro Briosi <ab1 at
metalit.com<mailto:ab1 at metalit.com>> wrote:
Hi all,
I have sporadic VM going down which files are on gluster FS.
If I look at the gluster logs the only events I find are:
/var/log/glusterfs/bricks/data-brick2-brick.log
[2017-05-08 09:51:17.661697] I [MSGID: 115036]
[server.c:548:server_rpc_notify] 0-datastore2-server: disconnecting
connection from
srvpve2-9074-2017/05/04-14:12:53:301448-datastore2-client-0-0-0
[2017-05-08 09:51:17.661697] I [MSGID: 115036]
[server.c:548:server_rpc_notify] 0-datastore2-server: disconnecting
connection from
srvpve2-9074-2017/05/04-14:12:53:367950-datastore2-client-0-0-0
[2017-05-08 09:51:17.661810] W [inodelk.c:399:pl_inodelk_log_cleanup]
0-datastore2-server: releasing lock on
66d9eefb-ee55-40ad-9f44-c55d1e809006 held by {client=0x7f4c7c004880,
pid=0 lk-owner=5c7099efc97f0000}
[2017-05-08 09:51:17.661810] W [inodelk.c:399:pl_inodelk_log_cleanup]
0-datastore2-server: releasing lock on
a8d82b3d-1cf9-45cf-9858-d8546710b49c held by {client=0x7f4c840f31d0,
pid=0 lk-owner=5c7019fac97f0000}
[2017-05-08 09:51:17.661835] I [MSGID: 115013]
[server-helpers.c:293:do_fd_cleanup] 0-datastore2-server: fd cleanup on
/images/201/vm-201-disk-2.qcow2
[2017-05-08 09:51:17.661838] I [MSGID: 115013]
[server-helpers.c:293:do_fd_cleanup] 0-datastore2-server: fd cleanup on
/images/201/vm-201-disk-1.qcow2
[2017-05-08 09:51:17.661953] I [MSGID: 101055]
[client_t.c:415:gf_client_unref] 0-datastore2-server: Shutting down
connection srvpve2-9074-2017/05/04-14:12:53:301448-datastore2-client-0-0-0
[2017-05-08 09:51:17.661953] I [MSGID: 101055]
[client_t.c:415:gf_client_unref] 0-datastore2-server: Shutting down
connection srvpve2-9074-2017/05/04-14:12:53:367950-datastore2-client-0-0-0
[2017-05-08 10:01:06.210392] I [MSGID: 115029]
[server-handshake.c:692:server_setvolume] 0-datastore2-server: accepted
client from
srvpve2-162483-2017/05/08-10:01:06:189720-datastore2-client-0-0-0
(version: 3.8.11)
[2017-05-08 10:01:06.237433] E [MSGID: 113107] [posix.c:1079:posix_seek]
0-datastore2-posix: seek failed on fd 18 length 42957209600 [No such
device or address]
[2017-05-08 10:01:06.237463] E [MSGID: 115089]
[server-rpc-fops.c:2007:server_seek_cbk] 0-datastore2-server: 18: SEEK-2
(a8d82b3d-1cf9-45cf-9858-d8546710b49c) ==> (No such device or address)
[No such device or address]
[2017-05-08 10:01:07.019974] I [MSGID: 115029]
[server-handshake.c:692:server_setvolume] 0-datastore2-server: accepted
client from
srvpve2-162483-2017/05/08-10:01:07:3687-datastore2-client-0-0-0
(version: 3.8.11)
[2017-05-08 10:01:07.041967] E [MSGID: 113107] [posix.c:1079:posix_seek]
0-datastore2-posix: seek failed on fd 19 length 859136720896 [No such
device or address]
[2017-05-08 10:01:07.041992] E [MSGID: 115089]
[server-rpc-fops.c:2007:server_seek_cbk] 0-datastore2-server: 18: SEEK-2
(66d9eefb-ee55-40ad-9f44-c55d1e809006) ==> (No such device or address)
[No such device or address]
The strange part is that I cannot seem to find any other error.
If I restart the VM everything works as expected (it stopped at ~9.51
UTC and was started at ~10.01 UTC) .
This is not the first time that this happened, and I do not see any
problems with networking or the hosts.
Gluster version is 3.8.11
this is the incriminated volume (though it happened on a different one too)
Volume Name: datastore2
Type: Replicate
Volume ID: c95ebb5f-6e04-4f09-91b9-bbbe63d83aea
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: srvpve2g:/data/brick2/brick
Brick2: srvpve3g:/data/brick2/brick
Brick3: srvpve1g:/data/brick2/brick (arbiter)
Options Reconfigured:
nfs.disable: on
performance.readdir-ahead: on
transport.address-family: inet
Any hint on how to dig more deeply into the reason would be greatly
appreciated.
Alessandro
_______________________________________________
Gluster-users mailing list
Gluster-users at gluster.org<mailto:Gluster-users at gluster.org>
http://lists.gluster.org/mailman/listinfo/gluster-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20170508/4568412d/attachment.html>
Il 08/05/2017 12:38, Krutika Dhananjay ha scritto:> The newly introduced "SEEK" fop seems to be failing at the bricks. > > Adding Niels for his inputs/help. >Don't know if this is related though the SEEK is done only when the VM is started, not when it's suddenly shutdown. Though it's an odd message (as the file really is there), the VM starts correctly. Alessandro
...> > client from > > srvpve2-162483-2017/05/08-10:01:06:189720-datastore2-client-0-0-0 > > (version: 3.8.11) > > [2017-05-08 10:01:06.237433] E [MSGID: 113107] [posix.c:1079:posix_seek] > > 0-datastore2-posix: seek failed on fd 18 length 42957209600 [No such > > device or address]The SEEK procedure translates to lseek() in the posix xlator. This can return with "No suck device or address" (ENXIO) in only one case: ENXIO whence is SEEK_DATA or SEEK_HOLE, and the file offset is beyond the end of the file. This means that an lseek() was executed where the current offset of the filedescriptor was higher than the size of the file. I'm not sure how that could happen... Sharding prevents using SEEK at all atm. ...> > The strange part is that I cannot seem to find any other error. > > If I restart the VM everything works as expected (it stopped at ~9.51 > > UTC and was started at ~10.01 UTC) . > > > > This is not the first time that this happened, and I do not see any > > problems with networking or the hosts. > > > > Gluster version is 3.8.11 > > this is the incriminated volume (though it happened on a different one too) > > > > Volume Name: datastore2 > > Type: Replicate > > Volume ID: c95ebb5f-6e04-4f09-91b9-bbbe63d83aea > > Status: Started > > Snapshot Count: 0 > > Number of Bricks: 1 x (2 + 1) = 3 > > Transport-type: tcp > > Bricks: > > Brick1: srvpve2g:/data/brick2/brick > > Brick2: srvpve3g:/data/brick2/brick > > Brick3: srvpve1g:/data/brick2/brick (arbiter) > > Options Reconfigured: > > nfs.disable: on > > performance.readdir-ahead: on > > transport.address-family: inet > > > > Any hint on how to dig more deeply into the reason would be greatly > > appreciated.Probably the problem is with SEEK support in the arbiter functionality. Just like with a READ or a WRITE on the arbiter brick, SEEK can only succeed on bricks where the files with content are located. It does not look like arbiter handles SEEK, so the offset in lseek() will likely be higher than the size of the file on the brick (empty, 0 size file). I don't know how the replication xlator responds on an error return from SEEK on one of the bricks, but I doubt it likes it. We have https://bugzilla.redhat.com/show_bug.cgi?id=1301647 to support SEEK for sharding. I suggest you open a bug for getting SEEK in the arbiter xlator as well. HTH, Niels -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 801 bytes Desc: not available URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170509/bfe1505e/attachment.sig>