The newly introduced "SEEK" fop seems to be failing at the bricks. Adding Niels for his inputs/help. -Krutika On Mon, May 8, 2017 at 3:43 PM, Alessandro Briosi <ab1 at metalit.com> wrote:> Hi all, > I have sporadic VM going down which files are on gluster FS. > > If I look at the gluster logs the only events I find are: > /var/log/glusterfs/bricks/data-brick2-brick.log > > [2017-05-08 09:51:17.661697] I [MSGID: 115036] > [server.c:548:server_rpc_notify] 0-datastore2-server: disconnecting > connection from > srvpve2-9074-2017/05/04-14:12:53:301448-datastore2-client-0-0-0 > [2017-05-08 09:51:17.661697] I [MSGID: 115036] > [server.c:548:server_rpc_notify] 0-datastore2-server: disconnecting > connection from > srvpve2-9074-2017/05/04-14:12:53:367950-datastore2-client-0-0-0 > [2017-05-08 09:51:17.661810] W [inodelk.c:399:pl_inodelk_log_cleanup] > 0-datastore2-server: releasing lock on > 66d9eefb-ee55-40ad-9f44-c55d1e809006 held by {client=0x7f4c7c004880, > pid=0 lk-owner=5c7099efc97f0000} > [2017-05-08 09:51:17.661810] W [inodelk.c:399:pl_inodelk_log_cleanup] > 0-datastore2-server: releasing lock on > a8d82b3d-1cf9-45cf-9858-d8546710b49c held by {client=0x7f4c840f31d0, > pid=0 lk-owner=5c7019fac97f0000} > [2017-05-08 09:51:17.661835] I [MSGID: 115013] > [server-helpers.c:293:do_fd_cleanup] 0-datastore2-server: fd cleanup on > /images/201/vm-201-disk-2.qcow2 > [2017-05-08 09:51:17.661838] I [MSGID: 115013] > [server-helpers.c:293:do_fd_cleanup] 0-datastore2-server: fd cleanup on > /images/201/vm-201-disk-1.qcow2 > [2017-05-08 09:51:17.661953] I [MSGID: 101055] > [client_t.c:415:gf_client_unref] 0-datastore2-server: Shutting down > connection srvpve2-9074-2017/05/04-14:12:53:301448-datastore2-client-0-0-0 > [2017-05-08 09:51:17.661953] I [MSGID: 101055] > [client_t.c:415:gf_client_unref] 0-datastore2-server: Shutting down > connection srvpve2-9074-2017/05/04-14:12:53:367950-datastore2-client-0-0-0 > [2017-05-08 10:01:06.210392] I [MSGID: 115029] > [server-handshake.c:692:server_setvolume] 0-datastore2-server: accepted > client from > srvpve2-162483-2017/05/08-10:01:06:189720-datastore2-client-0-0-0 > (version: 3.8.11) > [2017-05-08 10:01:06.237433] E [MSGID: 113107] [posix.c:1079:posix_seek] > 0-datastore2-posix: seek failed on fd 18 length 42957209600 [No such > device or address] > [2017-05-08 10:01:06.237463] E [MSGID: 115089] > [server-rpc-fops.c:2007:server_seek_cbk] 0-datastore2-server: 18: SEEK-2 > (a8d82b3d-1cf9-45cf-9858-d8546710b49c) ==> (No such device or address) > [No such device or address] > [2017-05-08 10:01:07.019974] I [MSGID: 115029] > [server-handshake.c:692:server_setvolume] 0-datastore2-server: accepted > client from > srvpve2-162483-2017/05/08-10:01:07:3687-datastore2-client-0-0-0 > (version: 3.8.11) > [2017-05-08 10:01:07.041967] E [MSGID: 113107] [posix.c:1079:posix_seek] > 0-datastore2-posix: seek failed on fd 19 length 859136720896 [No such > device or address] > [2017-05-08 10:01:07.041992] E [MSGID: 115089] > [server-rpc-fops.c:2007:server_seek_cbk] 0-datastore2-server: 18: SEEK-2 > (66d9eefb-ee55-40ad-9f44-c55d1e809006) ==> (No such device or address) > [No such device or address] > > The strange part is that I cannot seem to find any other error. > If I restart the VM everything works as expected (it stopped at ~9.51 > UTC and was started at ~10.01 UTC) . > > This is not the first time that this happened, and I do not see any > problems with networking or the hosts. > > Gluster version is 3.8.11 > this is the incriminated volume (though it happened on a different one too) > > Volume Name: datastore2 > Type: Replicate > Volume ID: c95ebb5f-6e04-4f09-91b9-bbbe63d83aea > Status: Started > Snapshot Count: 0 > Number of Bricks: 1 x (2 + 1) = 3 > Transport-type: tcp > Bricks: > Brick1: srvpve2g:/data/brick2/brick > Brick2: srvpve3g:/data/brick2/brick > Brick3: srvpve1g:/data/brick2/brick (arbiter) > Options Reconfigured: > nfs.disable: on > performance.readdir-ahead: on > transport.address-family: inet > > Any hint on how to dig more deeply into the reason would be greatly > appreciated. > > Alessandro > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://lists.gluster.org/mailman/listinfo/gluster-users >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170508/5fcac5a8/attachment.html>
I dont know if this has any relation to you issue. But I have seen several times during gluster healing that my wm?s fail or are marked unresponsive in rhev. My conclusion is that the load gluster puts on the wm-images during checksum while healing, result in to much latency and wm?s fail. My plans is to try using sharding, so the wm-images/files are split into smaller files, changing the number of allowed concurrent heals ?cluster.background-self-heal-count? and disabling ?cluster.self-heal-daemon?. /Jesper Fra: gluster-users-bounces at gluster.org [mailto:gluster-users-bounces at gluster.org] P? vegne af Krutika Dhananjay Sendt: 8. maj 2017 12:38 Til: Alessandro Briosi <ab1 at metalit.com>; de Vos, Niels <ndevos at redhat.com> Cc: gluster-users <gluster-users at gluster.org> Emne: Re: [Gluster-users] VM going down The newly introduced "SEEK" fop seems to be failing at the bricks. Adding Niels for his inputs/help. -Krutika On Mon, May 8, 2017 at 3:43 PM, Alessandro Briosi <ab1 at metalit.com<mailto:ab1 at metalit.com>> wrote: Hi all, I have sporadic VM going down which files are on gluster FS. If I look at the gluster logs the only events I find are: /var/log/glusterfs/bricks/data-brick2-brick.log [2017-05-08 09:51:17.661697] I [MSGID: 115036] [server.c:548:server_rpc_notify] 0-datastore2-server: disconnecting connection from srvpve2-9074-2017/05/04-14:12:53:301448-datastore2-client-0-0-0 [2017-05-08 09:51:17.661697] I [MSGID: 115036] [server.c:548:server_rpc_notify] 0-datastore2-server: disconnecting connection from srvpve2-9074-2017/05/04-14:12:53:367950-datastore2-client-0-0-0 [2017-05-08 09:51:17.661810] W [inodelk.c:399:pl_inodelk_log_cleanup] 0-datastore2-server: releasing lock on 66d9eefb-ee55-40ad-9f44-c55d1e809006 held by {client=0x7f4c7c004880, pid=0 lk-owner=5c7099efc97f0000} [2017-05-08 09:51:17.661810] W [inodelk.c:399:pl_inodelk_log_cleanup] 0-datastore2-server: releasing lock on a8d82b3d-1cf9-45cf-9858-d8546710b49c held by {client=0x7f4c840f31d0, pid=0 lk-owner=5c7019fac97f0000} [2017-05-08 09:51:17.661835] I [MSGID: 115013] [server-helpers.c:293:do_fd_cleanup] 0-datastore2-server: fd cleanup on /images/201/vm-201-disk-2.qcow2 [2017-05-08 09:51:17.661838] I [MSGID: 115013] [server-helpers.c:293:do_fd_cleanup] 0-datastore2-server: fd cleanup on /images/201/vm-201-disk-1.qcow2 [2017-05-08 09:51:17.661953] I [MSGID: 101055] [client_t.c:415:gf_client_unref] 0-datastore2-server: Shutting down connection srvpve2-9074-2017/05/04-14:12:53:301448-datastore2-client-0-0-0 [2017-05-08 09:51:17.661953] I [MSGID: 101055] [client_t.c:415:gf_client_unref] 0-datastore2-server: Shutting down connection srvpve2-9074-2017/05/04-14:12:53:367950-datastore2-client-0-0-0 [2017-05-08 10:01:06.210392] I [MSGID: 115029] [server-handshake.c:692:server_setvolume] 0-datastore2-server: accepted client from srvpve2-162483-2017/05/08-10:01:06:189720-datastore2-client-0-0-0 (version: 3.8.11) [2017-05-08 10:01:06.237433] E [MSGID: 113107] [posix.c:1079:posix_seek] 0-datastore2-posix: seek failed on fd 18 length 42957209600 [No such device or address] [2017-05-08 10:01:06.237463] E [MSGID: 115089] [server-rpc-fops.c:2007:server_seek_cbk] 0-datastore2-server: 18: SEEK-2 (a8d82b3d-1cf9-45cf-9858-d8546710b49c) ==> (No such device or address) [No such device or address] [2017-05-08 10:01:07.019974] I [MSGID: 115029] [server-handshake.c:692:server_setvolume] 0-datastore2-server: accepted client from srvpve2-162483-2017/05/08-10:01:07:3687-datastore2-client-0-0-0 (version: 3.8.11) [2017-05-08 10:01:07.041967] E [MSGID: 113107] [posix.c:1079:posix_seek] 0-datastore2-posix: seek failed on fd 19 length 859136720896 [No such device or address] [2017-05-08 10:01:07.041992] E [MSGID: 115089] [server-rpc-fops.c:2007:server_seek_cbk] 0-datastore2-server: 18: SEEK-2 (66d9eefb-ee55-40ad-9f44-c55d1e809006) ==> (No such device or address) [No such device or address] The strange part is that I cannot seem to find any other error. If I restart the VM everything works as expected (it stopped at ~9.51 UTC and was started at ~10.01 UTC) . This is not the first time that this happened, and I do not see any problems with networking or the hosts. Gluster version is 3.8.11 this is the incriminated volume (though it happened on a different one too) Volume Name: datastore2 Type: Replicate Volume ID: c95ebb5f-6e04-4f09-91b9-bbbe63d83aea Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: srvpve2g:/data/brick2/brick Brick2: srvpve3g:/data/brick2/brick Brick3: srvpve1g:/data/brick2/brick (arbiter) Options Reconfigured: nfs.disable: on performance.readdir-ahead: on transport.address-family: inet Any hint on how to dig more deeply into the reason would be greatly appreciated. Alessandro _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org<mailto:Gluster-users at gluster.org> http://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170508/4568412d/attachment.html>
Il 08/05/2017 12:38, Krutika Dhananjay ha scritto:> The newly introduced "SEEK" fop seems to be failing at the bricks. > > Adding Niels for his inputs/help. >Don't know if this is related though the SEEK is done only when the VM is started, not when it's suddenly shutdown. Though it's an odd message (as the file really is there), the VM starts correctly. Alessandro
...> > client from > > srvpve2-162483-2017/05/08-10:01:06:189720-datastore2-client-0-0-0 > > (version: 3.8.11) > > [2017-05-08 10:01:06.237433] E [MSGID: 113107] [posix.c:1079:posix_seek] > > 0-datastore2-posix: seek failed on fd 18 length 42957209600 [No such > > device or address]The SEEK procedure translates to lseek() in the posix xlator. This can return with "No suck device or address" (ENXIO) in only one case: ENXIO whence is SEEK_DATA or SEEK_HOLE, and the file offset is beyond the end of the file. This means that an lseek() was executed where the current offset of the filedescriptor was higher than the size of the file. I'm not sure how that could happen... Sharding prevents using SEEK at all atm. ...> > The strange part is that I cannot seem to find any other error. > > If I restart the VM everything works as expected (it stopped at ~9.51 > > UTC and was started at ~10.01 UTC) . > > > > This is not the first time that this happened, and I do not see any > > problems with networking or the hosts. > > > > Gluster version is 3.8.11 > > this is the incriminated volume (though it happened on a different one too) > > > > Volume Name: datastore2 > > Type: Replicate > > Volume ID: c95ebb5f-6e04-4f09-91b9-bbbe63d83aea > > Status: Started > > Snapshot Count: 0 > > Number of Bricks: 1 x (2 + 1) = 3 > > Transport-type: tcp > > Bricks: > > Brick1: srvpve2g:/data/brick2/brick > > Brick2: srvpve3g:/data/brick2/brick > > Brick3: srvpve1g:/data/brick2/brick (arbiter) > > Options Reconfigured: > > nfs.disable: on > > performance.readdir-ahead: on > > transport.address-family: inet > > > > Any hint on how to dig more deeply into the reason would be greatly > > appreciated.Probably the problem is with SEEK support in the arbiter functionality. Just like with a READ or a WRITE on the arbiter brick, SEEK can only succeed on bricks where the files with content are located. It does not look like arbiter handles SEEK, so the offset in lseek() will likely be higher than the size of the file on the brick (empty, 0 size file). I don't know how the replication xlator responds on an error return from SEEK on one of the bricks, but I doubt it likes it. We have https://bugzilla.redhat.com/show_bug.cgi?id=1301647 to support SEEK for sharding. I suggest you open a bug for getting SEEK in the arbiter xlator as well. HTH, Niels -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 801 bytes Desc: not available URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170509/bfe1505e/attachment.sig>