harry mangalam
2013-Apr-12 21:51 UTC
[Gluster-users] Glusterfs 3.3 rapidly generating write errors under heavy load.
As I've posted previously <http://goo.gl/DLplt> with increasing frequency, our academic cluster glusterfs (340TG over 4 nodes, 2 bricks each, details at bottom) is generating unacceptable errors under heavy load (which is the norm for the cluster). We use the SGE scheduler and it looks like gluster cannot keep up under heavy write load (as is the case with array jbs), or at least the kind of load that we're putting it under. Comments welcome. The user who has mostly been affected writes this: [[..it is the same issue that I've been seeing for a few days. I've been able to get access to up to 800 cores in the last week, which enables a high write load. These programs are also attempting to buffer the output by storing to large internal string streams before writing. A different script, which was based only on command-line manipulations of files (gzip, zcat, cut, and paste) had similar issues. I re-wrote those operations do be done in one fell swoop in C++, and it ran through just fine.]] For example, in the last 5 days, it has generated errors (' E ') in these numbers: biostor1 - 58 Errors (biostorX is the node; raid[12] are the bricks) raid1 - 2 Errors raid2 - 56 Errors biostor2 - 13532 Errors raid1 - 10384 Errors raid2 - 3148 Errors biostor3 - 35 Errors raid1 - 6 Errors raid2 - 29 Errors biostor4 - 98 Errors raid1 - 27 Errors raid2 - 71 Errors ===============================================================on bistor1, the errors were distributed like this (stripping the particulars): # errs file and position 44 [posix.c:358:posix_setattr] 8 [posix.c:823:posix_mknod] 2 [posix.c:1730:posix_create] 1 [server.c:176:server_submit_reply] 1 [rpcsvc.c:1080:rpcsvc_submit_generic] 1 [posix.c:857:posix_mknod] 1 [posix-helpers.c:685:posix_handle_pair] Examples: 44 x [2013-04-11 20:43:28.811049] E [posix.c:358:posix_setattr] 0-gl-posix: setattr (lstat) on /raid2/.glusterfs/9b/03/9b036627-864b-403a-8681- e4b1ad1a0da6 failed: No such file or directory (occurring in clumps - all 44 happened in one minute.) 8 x [2013-04-11 21:36:34.665924] E [posix.c:823:posix_mknod] 0-gl-posix: mknod on /raid2/bio/krthornt/WTCCC/explore_Jan2013/control_vs_control/esm/more_perms_collected/esm.500000.14 failed: File exists (7 within 2m) ===============================================================on bistor2, the errors were distributed like this (stripping the particulars): # errs file and position 7558 [posix.c:1852:posix_open] 3136 [posix.c:823:posix_mknod] 2819 [posix.c:223:posix_stat] 8 [posix.c:183:posix_lookup] 4 [posix.c:1730:posix_create] 2 [posix.c:857:posix_mknod] Examples: 7558 x [2013-04-11 20:30:12.080860] E [posix.c:1852:posix_open] 0-gl-posix: open on /raid1/.glusterfs/ba/03/ba035b25-ac26-451e-a1ec-9fd9262ce9a3: No such file or directory (all in ~13m) 3136 x [2013-04-11 14:44:49.185916] E [posix.c:823:posix_mknod] 0-gl-posix: mknod on /raid2/bio/krthornt/WTCCC/explore_Jan2013/control_vs_control/esm/more_perms_collected/esm.500000.3 failed: File exists (all in the same 13m as above - of these, all but 17 were referencing the SAME SGE array file: /raid2/bio/krthornt/WTCCC/explore_Jan2013/control_vs_control/esm/more_perms_collected/esm.500000.22) 2819 x [2013-04-11 20:30:16.469462] E [posix.c:223:posix_stat] 0-gl-posix: lstat on /raid1/.glusterfs/2c/54/2c545e08-a523-4502-bc1a-817e0368a04c failed: No such file or directory (all in the same 13m as above) ===============================================================on bistor3, the errors were distributed like this (stripping the particulars): # errs file and position 17 [server-helpers.c:763:server_alloc_frame] 15 [posix.c:823:posix_mknod] 3 [posix.c:1730:posix_create] Examples: 17 x [2013-04-08 14:22:28.835606] E [server-helpers.c:763:server_alloc_frame] (-->/usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x93) [0x327220a5b3] (-->/usr/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x293) [0x327220a443] (-- >/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_lookup+0xb8)[0x7fc6a9836558]))) 0-server: invalid argument: conn (in 2 batches within 1s each) 15 x [2013-04-10 11:30:44.453916] E [posix.c:823:posix_mknod] 0-gl-posix: mknod on /raid1/bio/krthornt/WTCCC/explore_Jan2013/control_vs_control/esm/more_perms/esm.500000.18 failed: File exists (9 in ~5m, 6 in ~3m; see also above; these are SGE array jobs so they're being generated quite fast.) =============================================================== on bistor4, the errors were distributed like this (stripping the particulars): # errs file and position 50 [server-helpers.c:763:server_alloc_frame] 26 [posix.c:823:posix_mknod] 8 [posix.c:857:posix_mknod] 8 [posix-helpers.c:685:posix_handle_pair] 2 [server.c:176:server_submit_reply] 2 [rpcsvc.c:1080:rpcsvc_submit_generic] 1 [posix.c:865:posix_mknod] 1 [posix.c:183:posix_lookup] Examples: 50 x [2013-04-08 13:36:42.286009] E [server-helpers.c:763:server_alloc_frame] (-->/usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x93) [0x39b200a5b3] (-->/usr/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x293) [0x39b200a443] (-- >/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_lookup+0xb8)[0x7f42e695e558]))) 0-server: invalid argument: conn (in 2 groups of 3 and 47, each group ocurring within 1s) 26 x [2013-04-11 10:00:47.609499] E [posix.c:823:posix_mknod] 0-gl-posix: mknod on /raid1/bio/tdlong/yeast2/data/bam/YEE_0000_00_00_00__.bam failed: File exists (2 groups of 6 and 15 each ocurred in 1s) =============================================================== Gluster configuration info: $ gluster volume info gl Volume Name: gl Type: Distribute Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332 Status: Started Number of Bricks: 8 Transport-type: tcp,rdma Bricks: Brick1: bs2:/raid1 Brick2: bs2:/raid2 Brick3: bs3:/raid1 Brick4: bs3:/raid2 Brick5: bs4:/raid1 Brick6: bs4:/raid2 Brick7: bs1:/raid1 Brick8: bs1:/raid2 Options Reconfigured: performance.write-behind-window-size: 1024MB performance.flush-behind: on performance.cache-size: 268435456 nfs.disable: on performance.io-cache: on performance.quick-read: on performance.io-thread-count: 64 auth.allow: 10.2.*.*,10.1.*.* =============================================================== $ gluster volume status gl detail Status of volume: gl ---------------------------------------------------------------- Brick : Brick bs2:/raid1 Port : 24009 Online : Y Pid : 2904 File System : xfs Device : /dev/sdc Mount Options : rw,noatime,sunit=512,swidth=8192,allocsize=32m Inode Size : 256 Disk Space Free : 28.2TB Total Disk Space : 43.7TB Inode Count : 9374964096 Free Inodes : 9372045017 ------------------------------------------------------------------------------ Brick : Brick bs2:/raid2 Port : 24011 Online : Y Pid : 2910 File System : xfs Device : /dev/sdd Mount Options : rw,noatime,sunit=512,swidth=7680,allocsize=32m Inode Size : 256 Disk Space Free : 27.2TB Total Disk Space : 40.9TB Inode Count : 8789028864 Free Inodes : 8786101538 ------------------------------------------------------------------------------ Brick : Brick bs3:/raid1 Port : 24009 Online : Y Pid : 2876 File System : xfs Device : /dev/sdc Mount Options : rw,noatime,sunit=512,swidth=8192,allocsize=32m Inode Size : 256 Disk Space Free : 28.5TB Total Disk Space : 43.7TB Inode Count : 9374964096 Free Inodes : 9372035932 ------------------------------------------------------------------------------ Brick : Brick bs3:/raid2 Port : 24011 Online : Y Pid : 2881 File System : xfs Device : /dev/sdd Mount Options : rw,noatime,sunit=512,swidth=7680,allocsize=32m Inode Size : 256 Disk Space Free : 25.0TB Total Disk Space : 40.9TB Inode Count : 8789028864 Free Inodes : 8786099214 ------------------------------------------------------------------------------ Brick : Brick bs4:/raid1 Port : 24009 Online : Y Pid : 2955 File System : xfs Device : /dev/sdc Mount Options : rw,noatime,sunit=512,swidth=8192,allocsize=32m Inode Size : 256 Disk Space Free : 28.0TB Total Disk Space : 43.7TB Inode Count : 9374964096 Free Inodes : 9372034051 ------------------------------------------------------------------------------ Brick : Brick bs4:/raid2 Port : 24011 Online : Y Pid : 2961 File System : xfs Device : /dev/sdd Mount Options : rw,noatime,sunit=512,swidth=7680,allocsize=32m Inode Size : 256 Disk Space Free : 24.1TB Total Disk Space : 40.9TB Inode Count : 8789028864 Free Inodes : 8786101010 ------------------------------------------------------------------------------ Brick : Brick bs1:/raid1 Port : 24013 Online : Y Pid : 3043 File System : xfs Device : /dev/sdc Mount Options : rw,noatime,sunit=512,swidth=8192,allocsize=32m Inode Size : 256 Disk Space Free : 29.1TB Total Disk Space : 43.7TB Inode Count : 9374964096 Free Inodes : 9372036362 ------------------------------------------------------------------------------ Brick : Brick bs1:/raid2 Port : 24015 Online : Y Pid : 3049 File System : xfs Device : /dev/sdd Mount Options : rw,noatime,sunit=512,swidth=7680,allocsize=32m Inode Size : 256 Disk Space Free : 25.9TB Total Disk Space : 40.9TB Inode Count : 8789028864 Free Inodes : 8786101382 --- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) --- "A Message From a Dying Veteran" <http://goo.gl/tTHdo>