Glomski, Patrick
2016-Jul-08 13:29 UTC
[Gluster-users] One client can effectively hang entire gluster array
Hello, users and devs. TL;DR: One gluster client can essentially cause denial of service / availability loss to entire gluster array. There's no way to stop it and almost no way to find the bad client. Probably all (at least 3.6 and 3.7) versions are affected. We have two large replicate gluster arrays (3.6.6 and 3.7.11) that are used in a high-performance computing environment. Two file access cases cause severe issues with glusterfs: Some of our scientific codes write hundreds of files (~400-500) simultaneously (one file or more per processor core, so lots of small or large writes) and others read thousands of files (2000-3000) simultaneously to grab metadata from each file (lots of small reads). In either of these situations, one glusterfsd process on whatever peer the client is currently talking to will skyrocket to *nproc* cpu usage (800%, 1600%) and the storage cluster is essentially useless; all other clients will eventually try to read or write data to the overloaded peer and, when that happens, their connection will hang. Heals between peers hang because the load on the peer is around 1.5x the number of cores or more. This occurs in either gluster 3.6 or 3.7, is very repeatable, and happens much too frequently. Even worse, there seems to be no definitive way to diagnose which client is causing the issues. Getting 'volume status <> clients' doesn't help because it reports the total number of bytes read/written by each client. (a) The metadata in question is tiny compared to the multi-gigabyte output files being dealt with and (b) the byte-count is cumulative for the clients and the compute nodes are always up with the filesystems mounted, so the byte transfer counts are astronomical. The best solution I've come up with is to blackhole-route traffic from clients one at a time (effectively push the traffic over to the other peer), wait a few minutes for all of the backlogged traffic to dissipate (if it's going to), see if the load on glusterfsd drops, and repeat until I find the client causing the issue. I would *love* any ideas on a better way to find rogue clients. More importantly, though, there must be some feature envorced to stop one user from having the capability to render the entire filesystem unavailable for all other users. In the worst case, I would even prefer a gluster volume option that simply disconnects clients making over some threshold of file open requests. That's WAY more preferable than a complete availability loss reminiscent of a DDoS attack... Apologies for the essay and looking forward to any help you can provide. Thanks, Patrick -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160708/4b38fbf1/attachment.html>
Jeff Darcy
2016-Jul-08 14:32 UTC
[Gluster-users] [Gluster-devel] One client can effectively hang entire gluster array
> In either of these situations, one glusterfsd process on whatever peer the > client is currently talking to will skyrocket to *nproc* cpu usage (800%, > 1600%) and the storage cluster is essentially useless; all other clients > will eventually try to read or write data to the overloaded peer and, when > that happens, their connection will hang. Heals between peers hang because > the load on the peer is around 1.5x the number of cores or more. This occurs > in either gluster 3.6 or 3.7, is very repeatable, and happens much too > frequently.I have some good news and some bad news. The good news is that features to address this are already planned for the 4.0 release. Primarily I'm referring to QoS enhancements, some parts of which were already implemented for the bitrot daemon. I'm still working out the exact requirements for this as a general facility, though. You can help! :) Also, some of the work on "brick multiplexing" (multiple bricks within one glusterfsd process) should help to prevent the thrashing that causes a complete freeze-up. Now for the bad news. Did I mention that these are 4.0 features? 4.0 is not near term, and not getting any nearer as other features and releases keep "jumping the queue" to absorb all of the resources we need for 4.0 to happen. Not that I'm bitter or anything. ;) To address your more immediate concerns, I think we need to consider more modest changes that can be completed in more modest time. For example: * The load should *never* get to 1.5x the number of cores. Perhaps we could tweak the thread-scaling code in io-threads and epoll to check system load and not scale up (or even scale down) if system load is already high. * We might be able to tweak io-threads (which already runs on the bricks and already has a global queue) to schedule requests in a fairer way across clients. Right now it executes them in the same order that they were read from the network. That tends to be a bit "unfair" and that should be fixed in the network code, but that's a much harder task. These are only weak approximations of what we really should be doing, and will be doing in the long term, but (without making any promises) they might be sufficient and achievable in the near term. Thoughts?
Steve Dainard
2016-Aug-19 19:22 UTC
[Gluster-users] One client can effectively hang entire gluster array
As a potential solution on the compute node side, can you have users copy relevant data from the gluster volume to a local disk (ie $TMDIR), operate on that disk, write output files to that disk, and then write the results back to persistent storage once the job is complete? There are lots of factors to consider, but this is how we operate in a small compute environment trying to avoid over-loading gluster storage nodes. On Fri, Jul 8, 2016 at 6:29 AM, Glomski, Patrick < patrick.glomski at corvidtec.com> wrote:> Hello, users and devs. > > TL;DR: One gluster client can essentially cause denial of service / > availability loss to entire gluster array. There's no way to stop it and > almost no way to find the bad client. Probably all (at least 3.6 and 3.7) > versions are affected. > > We have two large replicate gluster arrays (3.6.6 and 3.7.11) that are > used in a high-performance computing environment. Two file access cases > cause severe issues with glusterfs: Some of our scientific codes write > hundreds of files (~400-500) simultaneously (one file or more per processor > core, so lots of small or large writes) and others read thousands of files > (2000-3000) simultaneously to grab metadata from each file (lots of small > reads). > > In either of these situations, one glusterfsd process on whatever peer the > client is currently talking to will skyrocket to *nproc* cpu usage (800%, > 1600%) and the storage cluster is essentially useless; all other clients > will eventually try to read or write data to the overloaded peer and, when > that happens, their connection will hang. Heals between peers hang because > the load on the peer is around 1.5x the number of cores or more. This > occurs in either gluster 3.6 or 3.7, is very repeatable, and happens much > too frequently. > > Even worse, there seems to be no definitive way to diagnose which client > is causing the issues. Getting 'volume status <> clients' doesn't help > because it reports the total number of bytes read/written by each client. > (a) The metadata in question is tiny compared to the multi-gigabyte output > files being dealt with and (b) the byte-count is cumulative for the clients > and the compute nodes are always up with the filesystems mounted, so the > byte transfer counts are astronomical. The best solution I've come up with > is to blackhole-route traffic from clients one at a time (effectively push > the traffic over to the other peer), wait a few minutes for all of the > backlogged traffic to dissipate (if it's going to), see if the load on > glusterfsd drops, and repeat until I find the client causing the issue. I > would *love* any ideas on a better way to find rogue clients. > > More importantly, though, there must be some feature envorced to stop one > user from having the capability to render the entire filesystem unavailable > for all other users. In the worst case, I would even prefer a gluster > volume option that simply disconnects clients making over some threshold of > file open requests. That's WAY more preferable than a complete availability > loss reminiscent of a DDoS attack... > > Apologies for the essay and looking forward to any help you can provide. > > Thanks, > Patrick > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://www.gluster.org/mailman/listinfo/gluster-users >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160819/39df0fd0/attachment.html>