Harry Mangalam
2012-Jul-23 23:59 UTC
[Gluster-users] Simultaneous reads and writes from specific apps to IPoIB volume seem to conflict and kill performance.
I have fairly new gluster fs of 4 nodes with 2 RAID6 bricks on each node connected to a cluster via IPoIB on QDR IB. The servers are all SL6.2, running gluster 3.3-1; the clients are running the gluster-released glusterfs-fuse-3.3.0qa42-1 & glusterfs-3.3.0qa42-1. The volume seems normal: $ gluster volume info Volume Name: gl Type: Distribute Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332 Status: Started Number of Bricks: 8 Transport-type: tcp,rdma Bricks: Brick1: bs2:/raid1 Brick2: bs2:/raid2 Brick3: bs3:/raid1 Brick4: bs3:/raid2 Brick5: bs4:/raid1 Brick6: bs4:/raid2 Brick7: bs1:/raid1 Brick8: bs1:/raid2 Options Reconfigured: performance.cache-size: 268435456 nfs.disable: on performance.io-cache: on performance.quick-read: on performance.io-thread-count: 64 auth.allow: 10.2.*.*,10.1.*.* The logs on both the server and client are remarkable in their lack of anything amiss (the server has the previously reported zillion times repeating string of: I [socket.c:1798:socket_event_handler] 0-transport: disconnecting now which seems to be correlated with turning the NFS server off. This has been mentioned before. The gluster volume log, stripped of that line is here: <http://pastie.org/4309225> Individual large-file reads and writes are in the >300MB/s range which is not magnificent but tolerable. However, we've recently detected what appears to be a conflict in reading and writing for some applications. When some applications are reading and writing to the gluster fs, the client /usr/sbin/glusterfs increases its CPU consunmption to >100% and the IO goes to almost zero. When the inputs are on the gluster fs and the output is on another fs, performance is as good as on a local RAID. This seems to be specific to a particular application (bedtools, perhaps some other openmp genomics apps - still checking). Other utilities (cp, perl, tar, and other utilities ) that read and write to the gluster filesystem seem to be able to push and pull fairly large amount of data to/from it. The client is running a genomics utility (bedtools) which reads a very large chunks of data from the gluster fs, then aligns it to a reference genome. Stracing the run yields this stanza, after which it hangs until I kill it. The user has said that it does complete but at a speed hundreds of times slower (maybe timing out at each step..?) open("/data/users/tdlong/bin/genomeCoverageBed", O_RDONLY) = 3 ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fffcf0e5bb0) = -1 ENOTTY (Inappropriate ioctl for device) lseek(3, 0, SEEK_CUR) = 0 read(3, "#!/bin/sh\n${0%/*}/bedtools genom"..., 80) = 42 lseek(3, 0, SEEK_SET) = 0 getrlimit(RLIMIT_NOFILE, {rlim_cur=4*1024, rlim_max=4*1024}) = 0 dup2(3, 255) = 255 close(3) = 0 fcntl(255, F_SETFD, FD_CLOEXEC) = 0 fcntl(255, F_GETFL) = 0x8000 (flags O_RDONLY|O_LARGEFILE) fstat(255, {st_mode=S_IFREG|0755, st_size=42, ...}) = 0 lseek(255, 0, SEEK_CUR) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(255, "#!/bin/sh\n${0%/*}/bedtools genom"..., 42) = 42 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, [INT CHLD], [], 8) = 0 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2ae9318729e0) = 8229 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGINT, {0x436f40, [], SA_RESTORER, 0x3cb64302d0}, {SIG_DFL, [], SA_RESTORER, 0x3cb64302d0}, 8) = 0 wait4(-1, Does this indicate any optional tuning or operational parameters that we should be using? hjm -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
Harry Mangalam
2012-Jul-24 05:19 UTC
[Gluster-users] Simultaneous reads and writes from specific apps to IPoIB volume seem to conflict and kill performance.
Some more info.. I think the problem is the way the bedtools is writing the output - it's not getting buffered correctly. Using some more useful strace flags to force strace into the fork'ed child, you can see that the output is being written, just very slowly due to the awful, horrible, skeezy, skanky, lazy, wanky way that biologists (me included) tend to write code. ie: after the data is read in and processed, you get gigantic amounts of this kind of output being written to the file [pid 17021] 21:56:21 write(1, "U\t137095\t43\n", 12) = 12 <0.000120> [pid 17021] 21:56:21 write(1, "U\t137096\t40\n", 12) = 12 <0.000119> [pid 17021] 21:56:21 write(1, "U\t137097\t40\n", 12) = 12 <0.000119> [pid 17021] 21:56:21 write(1, "U\t137098\t40\n", 12) = 12 <0.000119> [pid 17021] 21:56:21 write(1, "U\t137099\t38\n", 12) = 12 <0.000116> [pid 17021] 21:56:21 write(1, "U\t137100\t38\n", 12) = 12 <0.000119> [pid 17021] 21:56:21 write(1, "U\t137101\t38\n", 12) = 12 <0.000117> ie (the file itself): ... 137098 U 137098 40 137099 U 137099 38 137100 U 137100 38 137101 U 137101 38 137102 U 137102 36 IT looks like the current gluster config isn't being set up to buffer this particular output correctly, so it's being written on a write-by-write basis. As noted below, my gluster performance options are:> performance.cache-size: 268435456 > performance.io-cache: on > performance.quick-read: on > performance.io-thread-count: 64Is there an option to address this extremely slow write perf? These options (p 38 of the 'Gluster File System 3.3.0 Administration Guide') sound like they may help but without knowing what they actually do, I'm hesitant to apply them to what is now a live fs. performance.flush-behind: If this option is set ON, instructs write-behind translator to perform flush in background, by returning success (or any errors, if any of previous writes were failed) to application even before flush is sent to backend filesystem. performance.write-behind-window-size Size of the per-file write-behind buffer. Advice? hjm On Mon, Jul 23, 2012 at 4:59 PM, Harry Mangalam <hjmangalam at gmail.com> wrote:> I have fairly new gluster fs of 4 nodes with 2 RAID6 bricks on each > node connected to a cluster via IPoIB on QDR IB. > The servers are all SL6.2, running gluster 3.3-1; the clients are > running the gluster-released glusterfs-fuse-3.3.0qa42-1 & > glusterfs-3.3.0qa42-1. > > The volume seems normal: > $ gluster volume info > Volume Name: gl > Type: Distribute > Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332 > Status: Started > Number of Bricks: 8 > Transport-type: tcp,rdma > Bricks: > Brick1: bs2:/raid1 > Brick2: bs2:/raid2 > Brick3: bs3:/raid1 > Brick4: bs3:/raid2 > Brick5: bs4:/raid1 > Brick6: bs4:/raid2 > Brick7: bs1:/raid1 > Brick8: bs1:/raid2 > Options Reconfigured: > performance.cache-size: 268435456 > nfs.disable: on > performance.io-cache: on > performance.quick-read: on > performance.io-thread-count: 64 > auth.allow: 10.2.*.*,10.1.*.* > > The logs on both the server and client are remarkable in their lack of > anything amiss (the server has the previously reported zillion times > repeating string of: > > I [socket.c:1798:socket_event_handler] 0-transport: disconnecting now > > which seems to be correlated with turning the NFS server off. This > has been mentioned before. > > The gluster volume log, stripped of that line is here: > <http://pastie.org/4309225> > > Individual large-file reads and writes are in the >300MB/s range which > is not magnificent but tolerable. However, we've recently detected > what appears to be a conflict in reading and writing for some > applications. When some applications are reading and writing to the > gluster fs, the client > /usr/sbin/glusterfs increases its CPU consunmption to >100% and the IO > goes to almost zero. > > When the inputs are on the gluster fs and the output is on another fs, > performance is as good as on a local RAID. > This seems to be specific to a particular application (bedtools, > perhaps some other openmp genomics apps - still checking). Other > utilities (cp, perl, tar, and other utilities ) that read and write > to the gluster filesystem seem to be able to push and pull fairly > large amount of data to/from it. > > The client is running a genomics utility (bedtools) which reads a very > large chunks of data from the gluster fs, then aligns it to a > reference genome. Stracing the run yields this stanza, after which it > hangs until I kill it. The user has said that it does complete but at > a speed hundreds of times slower (maybe timing out at each step..?) > > open("/data/users/tdlong/bin/genomeCoverageBed", O_RDONLY) = 3 > ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fffcf0e5bb0) = -1 ENOTTY > (Inappropriate ioctl for device) > lseek(3, 0, SEEK_CUR) = 0 > read(3, "#!/bin/sh\n${0%/*}/bedtools genom"..., 80) = 42 > lseek(3, 0, SEEK_SET) = 0 > getrlimit(RLIMIT_NOFILE, {rlim_cur=4*1024, rlim_max=4*1024}) = 0 > dup2(3, 255) = 255 > close(3) = 0 > fcntl(255, F_SETFD, FD_CLOEXEC) = 0 > fcntl(255, F_GETFL) = 0x8000 (flags O_RDONLY|O_LARGEFILE) > fstat(255, {st_mode=S_IFREG|0755, st_size=42, ...}) = 0 > lseek(255, 0, SEEK_CUR) = 0 > rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 > read(255, "#!/bin/sh\n${0%/*}/bedtools genom"..., 42) = 42 > rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 > rt_sigprocmask(SIG_BLOCK, [INT CHLD], [], 8) = 0 > clone(child_stack=0, > flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, > child_tidptr=0x2ae9318729e0) = 8229 > rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 > rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 > rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 > rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 > rt_sigaction(SIGINT, {0x436f40, [], SA_RESTORER, 0x3cb64302d0}, > {SIG_DFL, [], SA_RESTORER, 0x3cb64302d0}, 8) = 0 > wait4(-1, > > Does this indicate any optional tuning or operational parameters that > we should be using? > > hjm > > -- > Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine > [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 > 415 South Circle View Dr, Irvine, CA, 92697 [shipping] > MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)-- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)