harry mangalam
2013-Dec-13 01:03 UTC
[Gluster-users] gluster fails under heavy array job load load
Hi All, (Gluster Volume Details at bottom) I've posted some of this previously, but even after various upgrades, attempted fixes, etc, it remains a problem. Short version: Our gluster fs (~340TB) provides scratch space for a ~5000core academic compute cluster. Much of our load is streaming IO, doing a lot of genomics work, and that is the load under which we saw this latest failure. Under heavy batch load, especially array jobs, where there might be several 64core nodes doing I/O on the 4servers/8bricks, we often get job failures that have the following profile: Client POV: Here is a sampling of the client logs (/var/log/glusterfs/gl.log) for all compute nodes that indicated interaction with the user's files <http://pastie.org/8548781> Here are some client Info logs that seem fairly serious: <http://pastie.org/8548785> The errors that referenced this user were gathered from all the nodes that were running his code (in compute*) and agglomerated with: cut -f2,3 -d']' compute* |cut -f1 -dP | sort | uniq -c | sort -gr and placed here to show the profile of errors that his run generated. <http://pastie.org/8548796> so 71 of them were: W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-gl-client-7: remote operation failed: Transport endpoint is not connected. etc We've seen this before and previously discounted it bc it seems to have been related to the problem of spurious NFS-related bugs, but now I'm wondering whether it's a real problem. Also the 'remote operation failed: Stale file handle. ' warnings. There were no Errors logged per se, tho some of the W's looked fairly nasty, like the 'dht_layout_dir_mismatch'>From the server side, however, during the same period, there were:0 Warnings about this user's files 0 Errors 458 Info lines of which only 1 line was not a 'cleanup' line like this: --- 10.2.7.11:[2013-12-12 21:22:01.064289] I [server-helpers.c:460:do_fd_cleanup] 0-gl-server: fd cleanup on /path/to/file --- it was: --- 10.2.7.14:[2013-12-12 21:00:35.209015] I [server-rpc- fops.c:898:_gf_server_log_setxattr_failure] 0-gl-server: 113697332: SETXATTR /bio/tdlong/RNAseqIII/ckpt.1084030 (c9488341-c063-4175-8492-75e2e282f690) ==> trusted.glusterfs.dht --- We're losing about 10% of these kinds of array jobs bc of this, which is just not supportable. Gluster details servers and clients running gluster 3.4.0-8.el6 over QDR IB, IPoIB, thru 2 Mellanox, 1 Voltaire switches, Mellanox cards, CentOS 6.4 $ gluster volume info Volume Name: gl Type: Distribute Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332 Status: Started Number of Bricks: 8 Transport-type: tcp,rdma Bricks: Brick1: bs2:/raid1 Brick2: bs2:/raid2 Brick3: bs3:/raid1 Brick4: bs3:/raid2 Brick5: bs4:/raid1 Brick6: bs4:/raid2 Brick7: bs1:/raid1 Brick8: bs1:/raid2 Options Reconfigured: performance.write-behind-window-size: 1024MB performance.flush-behind: on performance.cache-size: 268435456 nfs.disable: on performance.io-cache: on performance.quick-read: on performance.io-thread-count: 64 auth.allow: 10.2.*.*,10.1.*.* 'gluster volume status gl detail': <http://pastie.org/8548826> --- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) --- -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20131212/f804e657/attachment.html>
Anand Avati
2013-Dec-13 07:46 UTC
[Gluster-users] gluster fails under heavy array job load load
Please provide the full client and server logs (in a bug report). The snippets give some hints, but are not very meaningful without the full context/history since mount time (they have after-the-fact symptoms, but not the part which show the reason why disconnects happened). Even before looking into the full logs here are some quick observations: - write-behind-window-size = 1024MB seems *excessively* high. Please set this to 1MB (default) and check if the stability improves. - I see RDMA is enabled on the volume. Are you mounting clients through RDMA? If so, for the purpose of diagnostics can you mount through TCP and check the stability improves? If you are using RDMA with such a high write-behind-window-size, spurious ping-timeouts are an almost certainty during heavy writes. The RDMA driver has limited flow control, and setting such a high window-size can easily congest all the RDMA buffers resulting in spurious ping-timeouts and disconnections. Avati On Thu, Dec 12, 2013 at 5:03 PM, harry mangalam <harry.mangalam at uci.edu>wrote:> Hi All, > > (Gluster Volume Details at bottom) > > > > I've posted some of this previously, but even after various upgrades, > attempted fixes, etc, it remains a problem. > > > > > > Short version: Our gluster fs (~340TB) provides scratch space for a > ~5000core academic compute cluster. > > Much of our load is streaming IO, doing a lot of genomics work, and that > is the load under which we saw this latest failure. > > Under heavy batch load, especially array jobs, where there might be > several 64core nodes doing I/O on the 4servers/8bricks, we often get job > failures that have the following profile: > > > > Client POV: > > Here is a sampling of the client logs (/var/log/glusterfs/gl.log) for all > compute nodes that indicated interaction with the user's files > > <http://pastie.org/8548781> > > > > Here are some client Info logs that seem fairly serious: > > <http://pastie.org/8548785> > > > > The errors that referenced this user were gathered from all the nodes that > were running his code (in compute*) and agglomerated with: > > > > cut -f2,3 -d']' compute* |cut -f1 -dP | sort | uniq -c | sort -gr > > > > and placed here to show the profile of errors that his run generated. > > <http://pastie.org/8548796> > > > > so 71 of them were: > > W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-gl-client-7: remote > operation failed: Transport endpoint is not connected. > > etc > > > > We've seen this before and previously discounted it bc it seems to have > been related to the problem of spurious NFS-related bugs, but now I'm > wondering whether it's a real problem. > > Also the 'remote operation failed: Stale file handle. ' warnings. > > > > There were no Errors logged per se, tho some of the W's looked fairly > nasty, like the 'dht_layout_dir_mismatch' > > > > From the server side, however, during the same period, there were: > > 0 Warnings about this user's files > > 0 Errors > > 458 Info lines > > of which only 1 line was not a 'cleanup' line like this: > > --- > > 10.2.7.11:[2013-12-12 21:22:01.064289] I > [server-helpers.c:460:do_fd_cleanup] 0-gl-server: fd cleanup on > /path/to/file > > --- > > it was: > > --- > > 10.2.7.14:[2013-12-12 21:00:35.209015] I > [server-rpc-fops.c:898:_gf_server_log_setxattr_failure] 0-gl-server: > 113697332: SETXATTR /bio/tdlong/RNAseqIII/ckpt.1084030 > (c9488341-c063-4175-8492-75e2e282f690) ==> trusted.glusterfs.dht > > --- > > > > We're losing about 10% of these kinds of array jobs bc of this, which is > just not supportable. > > > > > > > > Gluster details > > > > servers and clients running gluster 3.4.0-8.el6 over QDR IB, IPoIB, thru 2 > Mellanox, 1 Voltaire switches, Mellanox cards, CentOS 6.4 > > > > $ gluster volume info > > Volume Name: gl > > Type: Distribute > > Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332 > > Status: Started > > Number of Bricks: 8 > > Transport-type: tcp,rdma > > Bricks: > > Brick1: bs2:/raid1 > > Brick2: bs2:/raid2 > > Brick3: bs3:/raid1 > > Brick4: bs3:/raid2 > > Brick5: bs4:/raid1 > > Brick6: bs4:/raid2 > > Brick7: bs1:/raid1 > > Brick8: bs1:/raid2 > > Options Reconfigured: > > performance.write-behind-window-size: 1024MB > > performance.flush-behind: on > > performance.cache-size: 268435456 > > nfs.disable: on > > performance.io-cache: on > > performance.quick-read: on > > performance.io-thread-count: 64 > > auth.allow: 10.2.*.*,10.1.*.* > > > > > > 'gluster volume status gl detail': > > <http://pastie.org/8548826> > > > > --- > > Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine > > [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 > > 415 South Circle View Dr, Irvine, CA, 92697 [shipping] > > MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) > > --- > > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://supercolony.gluster.org/mailman/listinfo/gluster-users >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20131212/4fea4498/attachment.html>
Alex Chekholko
2013-Dec-13 22:00 UTC
[Gluster-users] gluster fails under heavy array job load load
Hi Harry, My best guess is that you overloaded your interconnect. Do you have metrics for if/when your network was saturated? That would cause Gluster clients to time out. My best guess is that you went into the "E" state of your "USE (Utilization, Saturation, Error)" spectrum. IME, that is a common pattern for out Lustre/GPFS clients, you get all kinds of weird error states if you manage to saturate your I/O for an extended period of time and fill all of the buffers everywhere. Regards, Alex On 12/12/2013 05:03 PM, harry mangalam wrote:> Short version: Our gluster fs (~340TB) provides scratch space for a > ~5000core academic compute cluster. > > Much of our load is streaming IO, doing a lot of genomics work, and that > is the load under which we saw this latest failure. >-- Alex Chekholko chekh at stanford.edu