thr3ads.net - Gluster users - [Gluster-users] kernel parameters for improving gluster writes on millions of small writes (long) [Jul 2012]

If this information is useful, please help other people find it:
Share via:

Harry Mangalam

2012-Jul-26 01:02 UTC

[Gluster-users] kernel parameters for improving gluster writes on millions of small writes (long)

This is a continuation of my previous posts about improving write perf
when trapping millions of small writes to a gluster filesystem.
I was able to improve write perf by ~30x by running STDOUT thru gzip
to consolidate and reduce the output stream.

Today, another similar problem, having to do with yet another
bioinformatics program (which these days typically handle the 'short
reads' that come out of the majority of sequencing hardware, each read
being 30-150 characters, with some metadata typically in an ASCII file
containing millions of such entries).  Reading them doesn't seem to be
a problem (at least on our systems) but writing them is quite awful..

The program is called 'art_illumina' from the Broad Inst's
'ALLPATHS'
suite and it generates an artificial Illumina data set from an input
genome.  In this case about 5GB of the type of data described above.
Like before, the gluster process goes to >100% and the program itself
slows to ~20-30% of a CPU.  In this case, the app's output cannot be
extrnally trapped by redirecting thru gzip since the output flag
specifies the base filename for 2 files that are created internally
and then written directly.  This prevents even setting up a named pipe
to trap and process the output.

Since this gluster storage was set up specifically for bioinformatics,
this is a repeating problem and while some of the issues can be dealt
with by trapping and converting output, it would be VERY NICE if we
could deal with it at the OS level.

The gluster volume is running over IPoIB on QDR IB and looks like this:
Volume Name: gl
Type: Distribute
Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332
Status: Started
Number of Bricks: 8
Transport-type: tcp,rdma
Bricks:
Brick1: bs2:/raid1
Brick2: bs2:/raid2
Brick3: bs3:/raid1
Brick4: bs3:/raid2
Brick5: bs4:/raid1
Brick6: bs4:/raid2
Brick7: bs1:/raid1
Brick8: bs1:/raid2
Options Reconfigured:
performance.write-behind-window-size: 1024MB
performance.flush-behind: on
performance.cache-size: 268435456
nfs.disable: on
performance.io-cache: on
performance.quick-read: on
performance.io-thread-count: 64
auth.allow: 10.2.*.*,10.1.*.*

I've tried to increase every caching option that might improve this
kind of performance, but it doesn't seem to help.  At this point, I'm
wondering whether changing the client (or server) kernel parameters
will help.

The client's meminfo is:
 cat  /proc/meminfo
MemTotal:       529425924 kB
MemFree:        241833188 kB
Buffers:          355248 kB
Cached:         279699444 kB
SwapCached:            0 kB
Active:          2241580 kB
Inactive:       278287248 kB
Active(anon):     190988 kB
Inactive(anon):   287952 kB
Active(file):    2050592 kB
Inactive(file): 277999296 kB
Unevictable:       16856 kB
Mlocked:           16856 kB
SwapTotal:      563198732 kB
SwapFree:       563198732 kB
Dirty:              1656 kB
Writeback:             0 kB
AnonPages:        486876 kB
Mapped:            19808 kB
Shmem:               164 kB
Slab:            1475476 kB
SReclaimable:    1205944 kB
SUnreclaim:       269532 kB
KernelStack:        5928 kB
PageTables:        27312 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    827911692 kB
Committed_AS:     536852 kB
VmallocTotal:   34359738367 kB
VmallocUsed:     1227732 kB
VmallocChunk:   33888774404 kB
HardwareCorrupted:     0 kB
AnonHugePages:    376832 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      201088 kB
DirectMap2M:    15509504 kB
DirectMap1G:    521142272 kB

and the server's meminfo is:

$ cat  /proc/meminfo
MemTotal:       32861400 kB
MemFree:         1232172 kB
Buffers:           29116 kB
Cached:         30017272 kB
SwapCached:           44 kB
Active:         18840852 kB
Inactive:       11772428 kB
Active(anon):     492928 kB
Inactive(anon):    75264 kB
Active(file):   18347924 kB
Inactive(file): 11697164 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:      16382900 kB
SwapFree:       16382680 kB
Dirty:                 8 kB
Writeback:             0 kB
AnonPages:        566876 kB
Mapped:            14212 kB
Shmem:              1276 kB
Slab:             429164 kB
SReclaimable:     324752 kB
SUnreclaim:       104412 kB
KernelStack:        3528 kB
PageTables:        16956 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    32813600 kB
Committed_AS:    3053096 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      340196 kB
VmallocChunk:   34342345980 kB
HardwareCorrupted:     0 kB
AnonHugePages:    200704 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        6656 kB
DirectMap2M:     2072576 kB
DirectMap1G:    31457280 kB

Does this suggest any approach?  Is there a doc that suggests optimal
kernel parameters for gluster?

I guess the only other option is to use the glusterfs as an NFS mount
and use the NFS client's caching..?  That will help on a single
process but decrease the overall cluster bandwidth considerably.

-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)

Washer, Bryan

2012-Jul-26 13:23 UTC

head link

[Gluster-users] kernel parameters for improving gluster writes on millions of small writes (long)

<html><body>Harry,

  Just a question but what file system are you using under the gluster system? 
You may need to tune that before you continue to try and tune the output system.
I found that by tuning using the xfs file system and tuning it for very large
files I was able to improve my performance quite a bit.  In this case though I
was working with a lot of big files so my tuning would not help you..but just
wanted to make sure you had looked at this detail in your setup.

Bryan

-----Original Message-----
From: gluster-users-bounces at gluster.org [mailto:gluster-users-bounces at
gluster.org] On Behalf Of Harry Mangalam
Sent: Wednesday, July 25, 2012 8:02 PM
To: gluster-users
Subject: [Gluster-users] kernel parameters for improving gluster writes on
millions of small writes (long)

This is a continuation of my previous posts about improving write perf
when trapping millions of small writes to a gluster filesystem.
I was able to improve write perf by ~30x by running STDOUT thru gzip
to consolidate and reduce the output stream.

Today, another similar problem, having to do with yet another
bioinformatics program (which these days typically handle the 'short
reads' that come out of the majority of sequencing hardware, each read
being 30-150 characters, with some metadata typically in an ASCII file
containing millions of such entries).  Reading them doesn't seem to be
a problem (at least on our systems) but writing them is quite awful..

The program is called 'art_illumina' from the Broad Inst's
'ALLPATHS'
suite and it generates an artificial Illumina data set from an input
genome.  In this case about 5GB of the type of data described above.
Like before, the gluster process goes to >100% and the program itself
slows to ~20-30% of a CPU.  In this case, the app's output cannot be
extrnally trapped by redirecting thru gzip since the output flag
specifies the base filename for 2 files that are created internally
and then written directly.  This prevents even setting up a named pipe
to trap and process the output.

Since this gluster storage was set up specifically for bioinformatics,
this is a repeating problem and while some of the issues can be dealt
with by trapping and converting output, it would be VERY NICE if we
could deal with it at the OS level.

The gluster volume is running over IPoIB on QDR IB and looks like this:
Volume Name: gl
Type: Distribute
Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332
Status: Started
Number of Bricks: 8
Transport-type: tcp,rdma
Bricks:
Brick1: bs2:/raid1
Brick2: bs2:/raid2
Brick3: bs3:/raid1
Brick4: bs3:/raid2
Brick5: bs4:/raid1
Brick6: bs4:/raid2
Brick7: bs1:/raid1
Brick8: bs1:/raid2
Options Reconfigured:
performance.write-behind-window-size: 1024MB
performance.flush-behind: on
performance.cache-size: 268435456
nfs.disable: on
performance.io-cache: on
performance.quick-read: on
performance.io-thread-count: 64
auth.allow: 10.2.*.*,10.1.*.*

I've tried to increase every caching option that might improve this
kind of performance, but it doesn't seem to help.  At this point, I'm
wondering whether changing the client (or server) kernel parameters
will help.

The client's meminfo is:
 cat  /proc/meminfo
MemTotal:       529425924 kB
MemFree:        241833188 kB
Buffers:          355248 kB
Cached:         279699444 kB
SwapCached:            0 kB
Active:          2241580 kB
Inactive:       278287248 kB
Active(anon):     190988 kB
Inactive(anon):   287952 kB
Active(file):    2050592 kB
Inactive(file): 277999296 kB
Unevictable:       16856 kB
Mlocked:           16856 kB
SwapTotal:      563198732 kB
SwapFree:       563198732 kB
Dirty:              1656 kB
Writeback:             0 kB
AnonPages:        486876 kB
Mapped:            19808 kB
Shmem:               164 kB
Slab:            1475476 kB
SReclaimable:    1205944 kB
SUnreclaim:       269532 kB
KernelStack:        5928 kB
PageTables:        27312 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    827911692 kB
Committed_AS:     536852 kB
VmallocTotal:   34359738367 kB
VmallocUsed:     1227732 kB
VmallocChunk:   33888774404 kB
HardwareCorrupted:     0 kB
AnonHugePages:    376832 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      201088 kB
DirectMap2M:    15509504 kB
DirectMap1G:    521142272 kB

and the server's meminfo is:

$ cat  /proc/meminfo
MemTotal:       32861400 kB
MemFree:         1232172 kB
Buffers:           29116 kB
Cached:         30017272 kB
SwapCached:           44 kB
Active:         18840852 kB
Inactive:       11772428 kB
Active(anon):     492928 kB
Inactive(anon):    75264 kB
Active(file):   18347924 kB
Inactive(file): 11697164 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:      16382900 kB
SwapFree:       16382680 kB
Dirty:                 8 kB
Writeback:             0 kB
AnonPages:        566876 kB
Mapped:            14212 kB
Shmem:              1276 kB
Slab:             429164 kB
SReclaimable:     324752 kB
SUnreclaim:       104412 kB
KernelStack:        3528 kB
PageTables:        16956 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    32813600 kB
Committed_AS:    3053096 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      340196 kB
VmallocChunk:   34342345980 kB
HardwareCorrupted:     0 kB
AnonHugePages:    200704 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        6656 kB
DirectMap2M:     2072576 kB
DirectMap1G:    31457280 kB

Does this suggest any approach?  Is there a doc that suggests optimal
kernel parameters for gluster?

I guess the only other option is to use the glusterfs as an NFS mount
and use the NFS client's caching..?  That will help on a single
process but decrease the overall cluster bandwidth considerably.

-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
_______________________________________________
Gluster-users mailing list
Gluster-users at gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


NOTICE: This email and any attachments may contain confidential and proprietary
information of NetSuite Inc. and is for the sole use of the intended recipient
for the stated purpose. Any improper use or distribution is prohibited. If you
are not the intended recipient, please notify the sender; do not review, copy or
distribute; and promptly delete or destroy all transmitted information. Please
note that all communications and information transmitted through this email
system may be monitored by NetSuite or its agents and that all incoming email is
automatically scanned by a third party spam and filtering service

</body></html>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://supercolony.gluster.org/pipermail/gluster-users/attachments/20120726/aa8cb8d0/attachment.html>

John Mark Walker

2012-Jul-26 15:07 UTC

head link

[Gluster-users] kernel parameters for improving gluster writes on millions of small writes (long)

Harry,

Have you seen this post?

http://community.gluster.org/a/linux-kernel-tuning-for-glusterfs/


Be sure and read all the comments, as Ben England chimes in on the comments, and
he's one of the performance engineers at Red Hat.

-JM


----- Harry Mangalam <hjmangalam at gmail.com>
wrote:> This is a continuation of my previous posts about improving write perf
> when trapping millions of small writes to a gluster filesystem.
> I was able to improve write perf by ~30x by running STDOUT thru gzip
> to consolidate and reduce the output stream.
> 
> Today, another similar problem, having to do with yet another
> bioinformatics program (which these days typically handle the 'short
> reads' that come out of the majority of sequencing hardware, each read
> being 30-150 characters, with some metadata typically in an ASCII file
> containing millions of such entries).  Reading them doesn't seem to be
> a problem (at least on our systems) but writing them is quite awful..
> 
> The program is called 'art_illumina' from the Broad Inst's
'ALLPATHS'
> suite and it generates an artificial Illumina data set from an input
> genome.  In this case about 5GB of the type of data described above.
> Like before, the gluster process goes to >100% and the program itself
> slows to ~20-30% of a CPU.  In this case, the app's output cannot be
> extrnally trapped by redirecting thru gzip since the output flag
> specifies the base filename for 2 files that are created internally
> and then written directly.  This prevents even setting up a named pipe
> to trap and process the output.
> 
> Since this gluster storage was set up specifically for bioinformatics,
> this is a repeating problem and while some of the issues can be dealt
> with by trapping and converting output, it would be VERY NICE if we
> could deal with it at the OS level.
> 
> The gluster volume is running over IPoIB on QDR IB and looks like this:
> Volume Name: gl
> Type: Distribute
> Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332
> Status: Started
> Number of Bricks: 8
> Transport-type: tcp,rdma
> Bricks:
> Brick1: bs2:/raid1
> Brick2: bs2:/raid2
> Brick3: bs3:/raid1
> Brick4: bs3:/raid2
> Brick5: bs4:/raid1
> Brick6: bs4:/raid2
> Brick7: bs1:/raid1
> Brick8: bs1:/raid2
> Options Reconfigured:
> performance.write-behind-window-size: 1024MB
> performance.flush-behind: on
> performance.cache-size: 268435456
> nfs.disable: on
> performance.io-cache: on
> performance.quick-read: on
> performance.io-thread-count: 64
> auth.allow: 10.2.*.*,10.1.*.*
> 
> I've tried to increase every caching option that might improve this
> kind of performance, but it doesn't seem to help.  At this point,
I'm
> wondering whether changing the client (or server) kernel parameters
> will help.
> 
> The client's meminfo is:
>  cat  /proc/meminfo
> MemTotal:       529425924 kB
> MemFree:        241833188 kB
> Buffers:          355248 kB
> Cached:         279699444 kB
> SwapCached:            0 kB
> Active:          2241580 kB
> Inactive:       278287248 kB
> Active(anon):     190988 kB
> Inactive(anon):   287952 kB
> Active(file):    2050592 kB
> Inactive(file): 277999296 kB
> Unevictable:       16856 kB
> Mlocked:           16856 kB
> SwapTotal:      563198732 kB
> SwapFree:       563198732 kB
> Dirty:              1656 kB
> Writeback:             0 kB
> AnonPages:        486876 kB
> Mapped:            19808 kB
> Shmem:               164 kB
> Slab:            1475476 kB
> SReclaimable:    1205944 kB
> SUnreclaim:       269532 kB
> KernelStack:        5928 kB
> PageTables:        27312 kB
> NFS_Unstable:          0 kB
> Bounce:                0 kB
> WritebackTmp:          0 kB
> CommitLimit:    827911692 kB
> Committed_AS:     536852 kB
> VmallocTotal:   34359738367 kB
> VmallocUsed:     1227732 kB
> VmallocChunk:   33888774404 kB
> HardwareCorrupted:     0 kB
> AnonHugePages:    376832 kB
> HugePages_Total:       0
> HugePages_Free:        0
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:       2048 kB
> DirectMap4k:      201088 kB
> DirectMap2M:    15509504 kB
> DirectMap1G:    521142272 kB
> 
> and the server's meminfo is:
> 
> $ cat  /proc/meminfo
> MemTotal:       32861400 kB
> MemFree:         1232172 kB
> Buffers:           29116 kB
> Cached:         30017272 kB
> SwapCached:           44 kB
> Active:         18840852 kB
> Inactive:       11772428 kB
> Active(anon):     492928 kB
> Inactive(anon):    75264 kB
> Active(file):   18347924 kB
> Inactive(file): 11697164 kB
> Unevictable:           0 kB
> Mlocked:               0 kB
> SwapTotal:      16382900 kB
> SwapFree:       16382680 kB
> Dirty:                 8 kB
> Writeback:             0 kB
> AnonPages:        566876 kB
> Mapped:            14212 kB
> Shmem:              1276 kB
> Slab:             429164 kB
> SReclaimable:     324752 kB
> SUnreclaim:       104412 kB
> KernelStack:        3528 kB
> PageTables:        16956 kB
> NFS_Unstable:          0 kB
> Bounce:                0 kB
> WritebackTmp:          0 kB
> CommitLimit:    32813600 kB
> Committed_AS:    3053096 kB
> VmallocTotal:   34359738367 kB
> VmallocUsed:      340196 kB
> VmallocChunk:   34342345980 kB
> HardwareCorrupted:     0 kB
> AnonHugePages:    200704 kB
> HugePages_Total:       0
> HugePages_Free:        0
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:       2048 kB
> DirectMap4k:        6656 kB
> DirectMap2M:     2072576 kB
> DirectMap1G:    31457280 kB
> 
> Does this suggest any approach?  Is there a doc that suggests optimal
> kernel parameters for gluster?
> 
> I guess the only other option is to use the glusterfs as an NFS mount
> and use the NFS client's caching..?  That will help on a single
> process but decrease the overall cluster bandwidth considerably.
> 
> -- 
> Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
> [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
> 415 South Circle View Dr, Irvine, CA, 92697 [shipping]
> MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Maybe Matching Threads

Search for more maybe matching threads

Gluster users - Jul 2012 - kernel parameters for improving gluster writes on millions of small writes (long)

[Gluster-users] kernel parameters for improving gluster writes on millions of small writes (long)

[Gluster-users] kernel parameters for improving gluster writes on millions of small writes (long)

[Gluster-users] kernel parameters for improving gluster writes on millions of small writes (long)

Maybe Matching Threads