thr3ads.net - Gluster users - [Gluster-users] Simultaneous reads and writes from specific apps to IPoIB volume seem to conflict and kill performance. [Jul 2012]

If this information is useful, please help other people find it:
Share via:

Harry Mangalam

2012-Jul-23 23:59 UTC

[Gluster-users] Simultaneous reads and writes from specific apps to IPoIB volume seem to conflict and kill performance.

I have fairly new gluster fs of 4 nodes with 2  RAID6 bricks on each
node connected to a cluster via IPoIB on QDR IB.
The servers are all SL6.2, running gluster 3.3-1; the clients are
running the gluster-released glusterfs-fuse-3.3.0qa42-1 &
glusterfs-3.3.0qa42-1.

The volume seems normal:
$ gluster volume info
Volume Name: gl
Type: Distribute
Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332
Status: Started
Number of Bricks: 8
Transport-type: tcp,rdma
Bricks:
Brick1: bs2:/raid1
Brick2: bs2:/raid2
Brick3: bs3:/raid1
Brick4: bs3:/raid2
Brick5: bs4:/raid1
Brick6: bs4:/raid2
Brick7: bs1:/raid1
Brick8: bs1:/raid2
Options Reconfigured:
performance.cache-size: 268435456
nfs.disable: on
performance.io-cache: on
performance.quick-read: on
performance.io-thread-count: 64
auth.allow: 10.2.*.*,10.1.*.*

The logs on both the server and client are remarkable in their lack of
anything amiss (the server has the previously reported zillion times
repeating string of:

    I [socket.c:1798:socket_event_handler] 0-transport: disconnecting now

which seems to be correlated with turning the NFS server off.  This
has been mentioned before.

The gluster volume log, stripped of that line is here:
<http://pastie.org/4309225>

Individual large-file reads and writes are in the >300MB/s range which
is not magnificent but tolerable. However, we've recently detected
what appears to be a conflict in reading and writing for some
applications.  When some applications are reading and writing to the
gluster fs, the client
/usr/sbin/glusterfs increases its CPU consunmption to >100% and the IO
goes to almost zero.

When the inputs are on the gluster fs and the output is on another fs,
performance is as good as on a local RAID.
This seems to be specific to a particular application (bedtools,
perhaps some other openmp genomics apps - still checking).  Other
utilities  (cp,  perl, tar, and other utilities ) that read and write
to the gluster filesystem seem to be able to push and pull fairly
large amount of data to/from it.

The client is running a genomics utility (bedtools) which reads a very
large chunks  of data from the gluster fs, then aligns it to a
reference genome.  Stracing the run yields this stanza, after which it
hangs until I kill it.  The user has said that it does complete but at
a speed hundreds of times slower (maybe timing out at each step..?)

open("/data/users/tdlong/bin/genomeCoverageBed", O_RDONLY) = 3
ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fffcf0e5bb0) = -1 ENOTTY
(Inappropriate ioctl for device)
lseek(3, 0, SEEK_CUR)                   = 0
read(3, "#!/bin/sh\n${0%/*}/bedtools genom"..., 80) = 42
lseek(3, 0, SEEK_SET)                   = 0
getrlimit(RLIMIT_NOFILE, {rlim_cur=4*1024, rlim_max=4*1024}) = 0
dup2(3, 255)                            = 255
close(3)                                = 0
fcntl(255, F_SETFD, FD_CLOEXEC)         = 0
fcntl(255, F_GETFL)                     = 0x8000 (flags O_RDONLY|O_LARGEFILE)
fstat(255, {st_mode=S_IFREG|0755, st_size=42, ...}) = 0
lseek(255, 0, SEEK_CUR)                 = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
read(255, "#!/bin/sh\n${0%/*}/bedtools genom"..., 42) = 42
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, [INT CHLD], [], 8) = 0
clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x2ae9318729e0) = 8229
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGINT, {0x436f40, [], SA_RESTORER, 0x3cb64302d0},
{SIG_DFL, [], SA_RESTORER, 0x3cb64302d0}, 8) = 0
wait4(-1,

Does this indicate any optional tuning or operational parameters that
we should be using?

hjm

-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)

Harry Mangalam

2012-Jul-24 05:19 UTC

head link

[Gluster-users] Simultaneous reads and writes from specific apps to IPoIB volume seem to conflict and kill performance.

Some more info..

I think the problem is the way the bedtools is writing the output -
it's not getting buffered correctly.
Using some more useful strace flags to force strace into the fork'ed
child, you can see that the output is being written, just very slowly
due to the awful, horrible, skeezy, skanky, lazy, wanky way that
biologists (me included) tend to write code. ie:

after the data is read in and processed, you get gigantic amounts of
this kind of output being written to the file
[pid 17021] 21:56:21 write(1, "U\t137095\t43\n", 12) = 12
<0.000120>
[pid 17021] 21:56:21 write(1, "U\t137096\t40\n", 12) = 12
<0.000119>
[pid 17021] 21:56:21 write(1, "U\t137097\t40\n", 12) = 12
<0.000119>
[pid 17021] 21:56:21 write(1, "U\t137098\t40\n", 12) = 12
<0.000119>
[pid 17021] 21:56:21 write(1, "U\t137099\t38\n", 12) = 12
<0.000116>
[pid 17021] 21:56:21 write(1, "U\t137100\t38\n", 12) = 12
<0.000119>
[pid 17021] 21:56:21 write(1, "U\t137101\t38\n", 12) = 12
<0.000117>

ie (the file itself):
...
 137098 U       137098  40
 137099 U       137099  38
 137100 U       137100  38
 137101 U       137101  38
 137102 U       137102  36

IT looks like the current gluster config  isn't being set up to buffer
this particular output correctly, so it's being written on a
write-by-write basis.

As noted below, my gluster performance options are:
> performance.cache-size: 268435456
> performance.io-cache: on
> performance.quick-read: on
> performance.io-thread-count: 64
Is there an option to address this extremely slow write perf?

These options (p 38 of the 'Gluster File System 3.3.0 Administration
Guide') sound like they may help but without knowing what they
actually do, I'm hesitant to apply them to what is now a live fs.

performance.flush-behind:
        If this option is set ON,
        instructs write-behind
        translator to perform
        flush in background,
        by returning success
        (or any errors, if any of
        previous writes were
        failed) to application
        even before flush
        is sent to backend
        filesystem.

performance.write-behind-window-size
        Size of the per-file
        write-behind buffer.

Advice?

hjm


On Mon, Jul 23, 2012 at 4:59 PM, Harry Mangalam <hjmangalam at gmail.com>
wrote:> I have fairly new gluster fs of 4 nodes with 2  RAID6 bricks on each
> node connected to a cluster via IPoIB on QDR IB.
> The servers are all SL6.2, running gluster 3.3-1; the clients are
> running the gluster-released glusterfs-fuse-3.3.0qa42-1 &
> glusterfs-3.3.0qa42-1.
>
> The volume seems normal:
> $ gluster volume info
> Volume Name: gl
> Type: Distribute
> Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332
> Status: Started
> Number of Bricks: 8
> Transport-type: tcp,rdma
> Bricks:
> Brick1: bs2:/raid1
> Brick2: bs2:/raid2
> Brick3: bs3:/raid1
> Brick4: bs3:/raid2
> Brick5: bs4:/raid1
> Brick6: bs4:/raid2
> Brick7: bs1:/raid1
> Brick8: bs1:/raid2
> Options Reconfigured:
> performance.cache-size: 268435456
> nfs.disable: on
> performance.io-cache: on
> performance.quick-read: on
> performance.io-thread-count: 64
> auth.allow: 10.2.*.*,10.1.*.*
>
> The logs on both the server and client are remarkable in their lack of
> anything amiss (the server has the previously reported zillion times
> repeating string of:
>
>     I [socket.c:1798:socket_event_handler] 0-transport: disconnecting now
>
> which seems to be correlated with turning the NFS server off.  This
> has been mentioned before.
>
> The gluster volume log, stripped of that line is here:
> <http://pastie.org/4309225>
>
> Individual large-file reads and writes are in the >300MB/s range which
> is not magnificent but tolerable. However, we've recently detected
> what appears to be a conflict in reading and writing for some
> applications.  When some applications are reading and writing to the
> gluster fs, the client
> /usr/sbin/glusterfs increases its CPU consunmption to >100% and the IO
> goes to almost zero.
>
> When the inputs are on the gluster fs and the output is on another fs,
> performance is as good as on a local RAID.
> This seems to be specific to a particular application (bedtools,
> perhaps some other openmp genomics apps - still checking).  Other
> utilities  (cp,  perl, tar, and other utilities ) that read and write
> to the gluster filesystem seem to be able to push and pull fairly
> large amount of data to/from it.
>
> The client is running a genomics utility (bedtools) which reads a very
> large chunks  of data from the gluster fs, then aligns it to a
> reference genome.  Stracing the run yields this stanza, after which it
> hangs until I kill it.  The user has said that it does complete but at
> a speed hundreds of times slower (maybe timing out at each step..?)
>
> open("/data/users/tdlong/bin/genomeCoverageBed", O_RDONLY) = 3
> ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fffcf0e5bb0) = -1 ENOTTY
> (Inappropriate ioctl for device)
> lseek(3, 0, SEEK_CUR)                   = 0
> read(3, "#!/bin/sh\n${0%/*}/bedtools genom"..., 80) = 42
> lseek(3, 0, SEEK_SET)                   = 0
> getrlimit(RLIMIT_NOFILE, {rlim_cur=4*1024, rlim_max=4*1024}) = 0
> dup2(3, 255)                            = 255
> close(3)                                = 0
> fcntl(255, F_SETFD, FD_CLOEXEC)         = 0
> fcntl(255, F_GETFL)                     = 0x8000 (flags
O_RDONLY|O_LARGEFILE)
> fstat(255, {st_mode=S_IFREG|0755, st_size=42, ...}) = 0
> lseek(255, 0, SEEK_CUR)                 = 0
> rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
> read(255, "#!/bin/sh\n${0%/*}/bedtools genom"..., 42) = 42
> rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
> rt_sigprocmask(SIG_BLOCK, [INT CHLD], [], 8) = 0
> clone(child_stack=0,
> flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
> child_tidptr=0x2ae9318729e0) = 8229
> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
> rt_sigaction(SIGINT, {0x436f40, [], SA_RESTORER, 0x3cb64302d0},
> {SIG_DFL, [], SA_RESTORER, 0x3cb64302d0}, 8) = 0
> wait4(-1,
>
> Does this indicate any optional tuning or operational parameters that
> we should be using?
>
> hjm
>
> --
> Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
> [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
> 415 South Circle View Dr, Irvine, CA, 92697 [shipping]
> MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)


-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)

Gluster users - Jul 2012 - Simultaneous reads and writes from specific apps to IPoIB volume seem to conflict and kill performance.

[Gluster-users] Simultaneous reads and writes from specific apps to IPoIB volume seem to conflict and kill performance.

[Gluster-users] Simultaneous reads and writes from specific apps to IPoIB volume seem to conflict and kill performance.