brandon at thinkhuge.net
2019-Mar-18 16:46 UTC
[Gluster-users] Transport endpoint is not connected failures in 5.3 under high I/O load
Hello list, We are having critical failures under load of CentOS7 glusterfs 5.3 with our servers losing their local mount point with the issue - "Transport endpoint is not connected" Not sure if it is related but the logs are full of the following message. [2019-03-18 14:00:02.656876] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler We operate multiple separate glusterfs distributed clusters of about 6-8 nodes. Our 2 biggest, separate, and most I/O active glusterfs clusters are both having the issues. We are trying to use glusterfs as a unified file system for pureftpd backup services for a VPS service. We have a relatively small backup window of the weekend where all our servers backup at the same time. When backups start early on Saturday it causes a sustained massive amount of FTP file upload I/O for around 48 hours with all the compressed backup files being uploaded. For our london 8 node cluster for example there is about 45 TB of uploads in ~48 hours currently. We do have some other smaller issues with directory listing under this load too but, it has been working for a couple years since 3.x until we've updated recently and randomly now all servers are losing their glusterfs mount with the "Transport endpoint is not connected" issue. Our glusterfs servers are all mostly the same with small variations. Mostly they are supermicro E3 cpu, 16 gb ram, LSI raid10 hdd (with and without bbu). Drive arrays vary between 4-16 sata3 hdd drives each node depending on if the servers are older or newer. Firmware is kept up-to-date as well as running the latest LSI compiled driver. the newer 16 drive backup servers have 2 x 1Gbit LACP teamed interfaces also. [root at lonbaknode3 ~]# uname -r 3.10.0-957.5.1.el7.x86_64 [root at lonbaknode3 ~]# rpm -qa |grep gluster centos-release-gluster5-1.0-1.el7.centos.noarch glusterfs-libs-5.3-2.el7.x86_64 glusterfs-api-5.3-2.el7.x86_64 glusterfs-5.3-2.el7.x86_64 glusterfs-cli-5.3-2.el7.x86_64 glusterfs-client-xlators-5.3-2.el7.x86_64 glusterfs-server-5.3-2.el7.x86_64 glusterfs-fuse-5.3-2.el7.x86_64 [root at lonbaknode3 ~]# [root at lonbaknode3 ~]# gluster volume info all Volume Name: volbackups Type: Distribute Volume ID: 32bf4fe9-5450-49f8-b6aa-05471d3bdffa Status: Started Snapshot Count: 0 Number of Bricks: 8 Transport-type: tcp Bricks: Brick1: lonbaknode3.domain.net:/lvbackups/brick Brick2: lonbaknode4.domain.net:/lvbackups/brick Brick3: lonbaknode5.domain.net:/lvbackups/brick Brick4: lonbaknode6.domain.net:/lvbackups/brick Brick5: lonbaknode7.domain.net:/lvbackups/brick Brick6: lonbaknode8.domain.net:/lvbackups/brick Brick7: lonbaknode9.domain.net:/lvbackups/brick Brick8: lonbaknode10.domain.net:/lvbackups/brick Options Reconfigured: transport.address-family: inet nfs.disable: on cluster.min-free-disk: 1% performance.cache-size: 8GB performance.cache-max-file-size: 128MB diagnostics.brick-log-level: WARNING diagnostics.brick-sys-log-level: WARNING client.event-threads: 3 performance.client-io-threads: on performance.io-thread-count: 24 network.inode-lru-limit: 1048576 performance.parallel-readdir: on performance.cache-invalidation: on performance.md-cache-timeout: 600 features.cache-invalidation: on features.cache-invalidation-timeout: 600 [root at lonbaknode3 ~]# Mount output shows the following: lonbaknode3.domain.net:/volbackups on /home/volbackups type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=1 31072) If you notice anything in our volume or mount settings above missing or otherwise bad feel free to let us know. Still learning this glusterfs. I tried searching for any recommended performance settings but, it's not always clear which setting is most applicable or beneficial to our workload. I have just found this post that looks like it is the same issues. https://lists.gluster.org/pipermail/gluster-users/2019-March/035958.html We have not yet tried the suggestion of "performance.write-behind: off" but, we will do so if that is recommended. Could someone knowledgeable advise anything for these issues? If any more information is needed do let us know. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190318/e0c57d1d/attachment.html>
Amar Tumballi Suryanarayan
2019-Mar-20 05:27 UTC
[Gluster-users] Transport endpoint is not connected failures in 5.3 under high I/O load
Hi Brandon, There were few concerns raised about 5.3 issues recently, and we fixed some of them and made 5.5 (in 5.4 we faced an upgrade issue, so 5.5 is recommended upgrade version). Can you please upgrade to 5.5 version? -Amar On Mon, Mar 18, 2019 at 10:16 PM <brandon at thinkhuge.net> wrote:> Hello list, > > > > We are having critical failures under load of CentOS7 glusterfs 5.3 with > our servers losing their local mount point with the issue - "Transport > endpoint is not connected" > > > > Not sure if it is related but the logs are full of the following message. > > > > [2019-03-18 14:00:02.656876] E [MSGID: 101191] > [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch > handler > > > > We operate multiple separate glusterfs distributed clusters of about 6-8 > nodes. Our 2 biggest, separate, and most I/O active glusterfs clusters are > both having the issues. > > > > We are trying to use glusterfs as a unified file system for pureftpd > backup services for a VPS service. We have a relatively small backup > window of the weekend where all our servers backup at the same time. When > backups start early on Saturday it causes a sustained massive amount of FTP > file upload I/O for around 48 hours with all the compressed backup files > being uploaded. For our london 8 node cluster for example there is about > 45 TB of uploads in ~48 hours currently. > > > > We do have some other smaller issues with directory listing under this > load too but, it has been working for a couple years since 3.x until we've > updated recently and randomly now all servers are losing their glusterfs > mount with the "Transport endpoint is not connected" issue. > > > > Our glusterfs servers are all mostly the same with small variations. > Mostly they are supermicro E3 cpu, 16 gb ram, LSI raid10 hdd (with and > without bbu). Drive arrays vary between 4-16 sata3 hdd drives each node > depending on if the servers are older or newer. Firmware is kept up-to-date > as well as running the latest LSI compiled driver. the newer 16 drive > backup servers have 2 x 1Gbit LACP teamed interfaces also. > > > > [root at lonbaknode3 ~]# uname -r > > 3.10.0-957.5.1.el7.x86_64 > > > > [root at lonbaknode3 ~]# rpm -qa |grep gluster > > centos-release-gluster5-1.0-1.el7.centos.noarch > > glusterfs-libs-5.3-2.el7.x86_64 > > glusterfs-api-5.3-2.el7.x86_64 > > glusterfs-5.3-2.el7.x86_64 > > glusterfs-cli-5.3-2.el7.x86_64 > > glusterfs-client-xlators-5.3-2.el7.x86_64 > > glusterfs-server-5.3-2.el7.x86_64 > > glusterfs-fuse-5.3-2.el7.x86_64 > > [root at lonbaknode3 ~]# > > > > [root at lonbaknode3 ~]# gluster volume info all > > > > Volume Name: volbackups > > Type: Distribute > > Volume ID: 32bf4fe9-5450-49f8-b6aa-05471d3bdffa > > Status: Started > > Snapshot Count: 0 > > Number of Bricks: 8 > > Transport-type: tcp > > Bricks: > > Brick1: lonbaknode3.domain.net:/lvbackups/brick > > Brick2: lonbaknode4.domain.net:/lvbackups/brick > > Brick3: lonbaknode5.domain.net:/lvbackups/brick > > Brick4: lonbaknode6.domain.net:/lvbackups/brick > > Brick5: lonbaknode7.domain.net:/lvbackups/brick > > Brick6: lonbaknode8.domain.net:/lvbackups/brick > > Brick7: lonbaknode9.domain.net:/lvbackups/brick > > Brick8: lonbaknode10.domain.net:/lvbackups/brick > > Options Reconfigured: > > transport.address-family: inet > > nfs.disable: on > > cluster.min-free-disk: 1% > > performance.cache-size: 8GB > > performance.cache-max-file-size: 128MB > > diagnostics.brick-log-level: WARNING > > diagnostics.brick-sys-log-level: WARNING > > client.event-threads: 3 > > performance.client-io-threads: on > > performance.io-thread-count: 24 > > network.inode-lru-limit: 1048576 > > performance.parallel-readdir: on > > performance.cache-invalidation: on > > performance.md-cache-timeout: 600 > > features.cache-invalidation: on > > features.cache-invalidation-timeout: 600 > > [root at lonbaknode3 ~]# > > > > Mount output shows the following: > > > > lonbaknode3.domain.net:/volbackups on /home/volbackups type > fuse.glusterfs > (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072) > > > > If you notice anything in our volume or mount settings above missing or > otherwise bad feel free to let us know. Still learning this glusterfs. I > tried searching for any recommended performance settings but, it's not > always clear which setting is most applicable or beneficial to our workload. > > > > I have just found this post that looks like it is the same issues. > > > > https://lists.gluster.org/pipermail/gluster-users/2019-March/035958.html > > > > We have not yet tried the suggestion of "performance.write-behind: off" > but, we will do so if that is recommended. > > > > Could someone knowledgeable advise anything for these issues? > > > > If any more information is needed do let us know. > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users-- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190320/3cc9608f/attachment.html>