On 07/30/2016 10:53 PM, Jay Berkenbilt wrote:> We're using glusterfs in Amazon EC2 and observing certain behavior > involving EBS volumes. The basic situation is that, in some cases, > clients can write data to the file system at a rate such that the > gluster daemon on one or more of the nodes may block in disk wait for > longer than 42 seconds, causing gluster to decide that the brick is > down. In fact, it's not down, it's just slow. I believe it is possible > by looking at certain system data to tell the difference from the system > with the drive on it between down and working through its queue. > > We are attempting a two-pronged approach to solving this problem: > > 1. We would like to figure out how to tune the system, including either > or both of adjusting kernel parameters or glusterd, to try to avoid > getting the system into the state of having so much data to flush out to > disk that it blocks in disk wait for such a long time. > 2. We would like to see if we can make gluster more intelligent about > responding to the pings so that the client side is still getting a > response when the remote side is just behind and not down. Though I do > understand that, in some high performance environments, one may want to > consider a disk that's not keeping up to have failed, so this may have > to be a tunable parameter. > > We have a small team that has been working on this problem for a couple > of weeks. I just joined the team on Friday. I am new to gluster, but I > am not at all new to low-level system programming, Linux administration, > etc. I'm very much open to the possibility of digging into the gluster > code and supplying patchesWelcome to Gluster. It is great to see a lot of ideas within days :).> if we can find a way to adjust the behavior > of gluster to make it behave better under these conditions. > > So, here are my questions: > > * Does anyone have experience with this type of issue who can offer any > suggestions on kernel parameters or gluster configurations we could play > with? We have several kernel parameters in mind and are starting to > measure their affect. > * Does anyone have any background on how we might be able to tell that > the system is getting itself into this state? Again, we have some ideas > on this already, mostly by using sysstat to monitor stuff, though > ultimately if we find a reliable way to do it, we'd probably code it > directly by looking at the relevant stuff in /proc from our own code. I > don't have the details with me right now. > * Can someone provide any pointers to where in the gluster code the ping > logic is handled and/or how one might go about making it a little smarter?One of the user had similar problems where ping packets are queued on waiting list because of a huge traffic. I have a patch which try to solve the issue http://review.gluster.org/#/c/11935/ . Which is under review and might need some more work, but I guess it is worth trying If your interested, you can try it out and let me know whether it solve the issue or not. What the patch does is, it consider PING packets as the most prioritized packets, and will add into the beginning of ioq list (list which contains packet to be send via wire) . I might have missed some important points from the long mail ;). I'm sorry, I was too lazy to read it completely :). Regards Rafi KC> * Does my description of what we're dealing with suggest that we're just > missing something obvious? I jokingly asked the team whether they had > remembered to run glusterd with the --make-it-fast flag, but sometimes > there are solutions almost like that that we just overlook. > > For what it's worth, we're running gluster 3.8 on CentOS 7 in EC2. We > see the problem the most strongly when using general purpose (gp2) EBS > volumes on higher performance but non-EBS optimized volumes where it's > pretty easy to overload the disk with traffic over the network. We can > mostly mitigate this by using provisioned I/O volumes or EBS optimized > volumes on slower instances where the disk outperforms what we can throw > at it over the network. Yet at our scale, switching to EBS optimization > would cost hundreds of thousands of dollars a year, and running slower > instances has obvious drawbacks. In the absence of a "real" solution, we > will probably end up trying to modify our software to throttle writes to > disk, but having to modify our software to keep from flooding the file > system seems like a really sad thing to have to do. > > Thanks in advance for any pointers! > > --Jay > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://www.gluster.org/mailman/listinfo/gluster-users
So we managed to work around the behavior by setting sysctl -w vm.dirty_bytes=50000000 sysctl -w vm.dirty_background_bytes=25000000 In our environment with our specific load testing, this prevents the disk flush from taking longer than gluster's timeout and avoids the whole problem with gluster timing out. We haven't finished our performance testing, but initial results suggest that it is no worse than the performance we had with our previous home-grown solution. In our previous home grown solution, we had a fuse layer that was calling fsync() on every megabyte written as soon as there were 10 megabytes worth of requests in the queue, which was effectively emulating in user code what these kernel parameters do but with even smaller numbers. Thanks for the note below about the potential patch. I applied this to 3.8.1 with the fix based on the code review comment and have that in my back pocket in case we need it, but we're going to try with just the kernel tuning for now. These parameters are decent for us anyway because, for other reasons based on the nature of our application and certain customer requirements, we want to keep the amount of dirty data really low. It looks like the code review has been idle for some time. Any reason? It looks like a simple and relatively obvious change (not to take anything away from it at all, and I really appreciate the pointer). Is there anything potentially unsafe about it? Like are there some cases where not always appending to the queue could cause damage to data if the test wasn't exactly right or wasn't doing exactly what it was expecting? If I were to run our load test against the patch, it wouldn't catch anything like that because we don't actually look at the content of the data written in our load test. In any case, if the kernel tuning doesn't completely solve the problem for us, I may pull this out and do some more rigorous testing against it. If I do, I can comment on the code change. For now, unless I post otherwise, we're considering our specific problem to be resolved, though I believe there remains a potential weakness in gluster's ability to report that it is still up in the case of a slower disk write speed on one of the nodes. --Jay On 08/01/2016 01:29 AM, Mohammed Rafi K C wrote:> > On 07/30/2016 10:53 PM, Jay Berkenbilt wrote: >> We're using glusterfs in Amazon EC2 and observing certain behavior >> involving EBS volumes. The basic situation is that, in some cases, >> clients can write data to the file system at a rate such that the >> gluster daemon on one or more of the nodes may block in disk wait for >> longer than 42 seconds, causing gluster to decide that the brick is >> down. In fact, it's not down, it's just slow. I believe it is possible >> by looking at certain system data to tell the difference from the system >> with the drive on it between down and working through its queue. >> >> We are attempting a two-pronged approach to solving this problem: >> >> 1. We would like to figure out how to tune the system, including either >> or both of adjusting kernel parameters or glusterd, to try to avoid >> getting the system into the state of having so much data to flush out to >> disk that it blocks in disk wait for such a long time. >> 2. We would like to see if we can make gluster more intelligent about >> responding to the pings so that the client side is still getting a >> response when the remote side is just behind and not down. Though I do >> understand that, in some high performance environments, one may want to >> consider a disk that's not keeping up to have failed, so this may have >> to be a tunable parameter. >> >> We have a small team that has been working on this problem for a couple >> of weeks. I just joined the team on Friday. I am new to gluster, but I >> am not at all new to low-level system programming, Linux administration, >> etc. I'm very much open to the possibility of digging into the gluster >> code and supplying patches > Welcome to Gluster. It is great to see a lot of ideas within days :). > > >> if we can find a way to adjust the behavior >> of gluster to make it behave better under these conditions. >> >> So, here are my questions: >> >> * Does anyone have experience with this type of issue who can offer any >> suggestions on kernel parameters or gluster configurations we could play >> with? We have several kernel parameters in mind and are starting to >> measure their affect. >> * Does anyone have any background on how we might be able to tell that >> the system is getting itself into this state? Again, we have some ideas >> on this already, mostly by using sysstat to monitor stuff, though >> ultimately if we find a reliable way to do it, we'd probably code it >> directly by looking at the relevant stuff in /proc from our own code. I >> don't have the details with me right now. >> * Can someone provide any pointers to where in the gluster code the ping >> logic is handled and/or how one might go about making it a little smarter? > One of the user had similar problems where ping packets are queued on > waiting list because of a huge traffic. I have a patch which try to > solve the issue http://review.gluster.org/#/c/11935/ . Which is under > review and might need some more work, but I guess it is worth trying > > If your interested, you can try it out and let me know whether it solve > the issue or not. What the patch does is, it consider PING packets as > the most prioritized packets, and will add into the beginning of ioq > list (list which contains packet to be send via wire) . > > > I might have missed some important points from the long mail ;). I'm > sorry, I was too lazy to read it completely :). > > Regards > Rafi KC > >> * Does my description of what we're dealing with suggest that we're just >> missing something obvious? I jokingly asked the team whether they had >> remembered to run glusterd with the --make-it-fast flag, but sometimes >> there are solutions almost like that that we just overlook. >> >> For what it's worth, we're running gluster 3.8 on CentOS 7 in EC2. We >> see the problem the most strongly when using general purpose (gp2) EBS >> volumes on higher performance but non-EBS optimized volumes where it's >> pretty easy to overload the disk with traffic over the network. We can >> mostly mitigate this by using provisioned I/O volumes or EBS optimized >> volumes on slower instances where the disk outperforms what we can throw >> at it over the network. Yet at our scale, switching to EBS optimization >> would cost hundreds of thousands of dollars a year, and running slower >> instances has obvious drawbacks. In the absence of a "real" solution, we >> will probably end up trying to modify our software to throttle writes to >> disk, but having to modify our software to keep from flooding the file >> system seems like a really sad thing to have to do. >> >> Thanks in advance for any pointers! >> >> --Jay >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> http://www.gluster.org/mailman/listinfo/gluster-users
On 08/01/2016 01:29 AM, Mohammed Rafi K C wrote:> > > On 07/30/2016 10:53 PM, Jay Berkenbilt wrote: >> We're using glusterfs in Amazon EC2 and observing certain behavior >> involving EBS volumes. The basic situation is that, in some cases, >> clients can write data to the file system at a rate such that the >> gluster daemon on one or more of the nodes may block in disk wait for >> longer than 42 seconds, causing gluster to decide that the brick is >> down. In fact, it's not down, it's just slow. I believe it is possible >> by looking at certain system data to tell the difference from the system >> with the drive on it between down and working through its queue. >> >> We are attempting a two-pronged approach to solving this problem: >> >> 1. We would like to figure out how to tune the system, including either >> or both of adjusting kernel parameters or glusterd, to try to avoid >> getting the system into the state of having so much data to flush out to >> disk that it blocks in disk wait for such a long time. >> 2. We would like to see if we can make gluster more intelligent about >> responding to the pings so that the client side is still getting a >> response when the remote side is just behind and not down. Though I do >> understand that, in some high performance environments, one may want to >> consider a disk that's not keeping up to have failed, so this may have >> to be a tunable parameter. >> >> We have a small team that has been working on this problem for a couple >> of weeks. I just joined the team on Friday. I am new to gluster, but I >> am not at all new to low-level system programming, Linux administration, >> etc. I'm very much open to the possibility of digging into the gluster >> code and supplying patches > > Welcome to Gluster. It is great to see a lot of ideas within days :). > > >> if we can find a way to adjust the behavior >> of gluster to make it behave better under these conditions. >> >> So, here are my questions: >> >> * Does anyone have experience with this type of issue who can offer any >> suggestions on kernel parameters or gluster configurations we could play >> with? We have several kernel parameters in mind and are starting to >> measure their affect. >> * Does anyone have any background on how we might be able to tell that >> the system is getting itself into this state? Again, we have some ideas >> on this already, mostly by using sysstat to monitor stuff, though >> ultimately if we find a reliable way to do it, we'd probably code it >> directly by looking at the relevant stuff in /proc from our own code. I >> don't have the details with me right now. >> * Can someone provide any pointers to where in the gluster code the ping >> logic is handled and/or how one might go about making it a little smarter? > > One of the user had similar problems where ping packets are queued on > waiting list because of a huge traffic. I have a patch which try to > solve the issue http://review.gluster.org/#/c/11935/ . Which is under > review and might need some more work, but I guess it is worth trying >Would it be possible to rebase this patch against the latest master? I am interested to see if we still see the pre-commit regression failures. Thanks! Vijay