thr3ads.net - Gluster users - [Gluster-users] Slow seek times on stat calls to glusterfs metadata [Dec 2017]

If this information is useful, please help other people find it:
Share via:

Tom Fite

2017-Dec-05 19:14 UTC

[Gluster-users] Slow seek times on stat calls to glusterfs metadata

Hi all,

I have a distributed / replicated pool consisting of 2 boxes, with 3 bricks
a piece. Each brick is mounted via a RAID 6 array consisting of 11 6 TB
disks. I'm running CentOS 7 with XFS and LVM. The 150 TB pool is loaded
with about 15 TB of data. Clients are connected via FUSE. I'm using
glusterfs 3.12.1.

I've found that running large rsyncs to populate the pool are taking a very
long time, specifically with small files (write throughput is fine). I
know, I know -- small files on gluster do not perform well, but I'm seeing
particularly terrible performance in the range of around 25 to 50 creates
per second.

Profiling and testing indicate the main bottleneck is lstat calls on
glusterfs metadata. Running an strace against the glusterd PIDs during a
migration shows a lot of lstat calls taking a relatively long time to
complete:

strace -Tfp 3544 -p 3550 -p 3536 2>&1 >/dev/null | awk
'{gsub(/[<>]/,"",$NF)}$NF+.0>0.5' | grep -Ev
"futex|epoll|select|nanosleep"> 500 ms[pid 3748] <... lstat resumed> 0x7f6db004f220) = -1 ENOENT (No such file
or
directory) 0.773194
[pid 29234] <... lstat resumed> 0x7f4c500ac220) = -1 ENOENT (No such file
or directory) 1.010627
[pid 13083] <... lstat resumed> 0x7f1c3416c220) = -1 ENOENT (No such file
or directory) 0.629203

These lstats can be traced back to calls that look similar to this:

[pid 31570]
lstat("/data/brick1/gv0/.glusterfs/1a/61/1a616193-ddef-453b-a86d-dea73c7da496",
0x7f1778067220) = -1 ENOENT (No such file or directory) 0.102771
[pid 31568]
lstat("/data/brick1/gv0/.glusterfs/7f/0b/7f0bf1d3-b3e9-4009-9692-4e2e55c6c822",
0x7f17780e9220) = -1 ENOENT (No such file or directory) 0.052719
[pid 31564]
lstat("/data/brick2/gv0/.glusterfs/b0/49/b049a03c-114a-443c-bdfc-71ee981d8e84",
0x7f296c575220) = -1 ENOENT (No such file or directory) 0.195458

My theory is this: as the gluster pool fills with data, the .glusterfs
metadata is scattered around disk, causing random IO seek times to
increase. Each file create causes a read / seek in the glusterfs metadata
folders to a non existent file, which takes a long time to look up due to
the random nature of the directory hashes. If this is the root of the
problem, this isn't specifically a problem with gluster per se, but a
problem with LVM, XFS, RAID configuration, or my drives.

This bug report might be the same issue:
https://bugzilla.redhat.com/show_bug.cgi?id=1200457

I wanted to check with the group to see if anybody else has run into this
before if there are suggestions that might help. Specifically --

1. Would adding an SSD hot tier to my pool help here? Is the glusterfs
metadata cached in the hot tier or does hot tiering only cache frequently
accessed files in the pool?

2. I have had some success with forcing the glusterfs dirents into system
cache, by running a find in the .glusterfs directory to enumerate and warm
the cache with all dirents, eliminating seeks on disk. However, I'm at the
mercy of the OS here, and as soon as the dirents are dropped things get
slow again. Anybody know of a way to keep the metadata in cache? I have 128
GB of RAM to work with, so I should be able to aggressively cache.

3. Are giant RAID 6 arrays just not going to perform well here? Would more
bricks / smaller array sizes or a different RAID level help?

4. Would adding more servers to gluster pool help or hurt?

Here's my glusterfs config, I've been trying every optimization tweak
that
I can find, including md-cache, bumping up cache sizes, bumping event
threads, etc...

Volume Name: gv0
Type: Distributed-Replicate
Volume ID: [ID]
Status: Started
Snapshot Count: 13
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: pod-sjc1-gluster1.exavault.com:/data/brick1/gv0
Brick2: pod-sjc1-gluster2.exavault.com:/data/brick1/gv0
Brick3: pod-sjc1-gluster1.exavault.com:/data/brick2/gv0
Brick4: pod-sjc1-gluster2.exavault.com:/data/brick2/gv0
Brick5: pod-sjc1-gluster1.exavault.com:/data/brick3/gv0
Brick6: pod-sjc1-gluster2.exavault.com:/data/brick3/gv0
Options Reconfigured:
performance.cache-refresh-timeout: 60
performance.stat-prefetch: on
server.outstanding-rpc-limit: 1024
cluster.lookup-optimize: on
performance.client-io-threads: on
nfs.disable: on
transport.address-family: inet
features.barrier: disable
client.event-threads: 16
server.event-threads: 16
performance.cache-size: 4GB
network.inode-lru-limit: 90000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
performance.quick-read: on
performance.io-cache: on
performance.nfs.write-behind-window-size: 512MB
performance.write-behind-window-size: 4MB
performance.nfs.io-threads: on
network.tcp-window-size: 1048576
performance.rda-cache-limit: 32MB
performance.flush-behind: on
server.allow-insecure: on
auto-delete: enable

Thanks
-Tom
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20171205/0eb42b4f/attachment.html>

Tom Fite

2018-Jan-12 17:45 UTC

head link

[Gluster-users] Slow seek times on stat calls to glusterfs metadata

To follow up on this, I've added an SSD backed hot tier to my cluster and
this dramatically improved performance. From observing iostat, it appears
that all new files are created on the hot tier and migrated to the cold
tier when the demotion daemon runs. Since new files use the hot tier, this
avoids the stat() calls on spinning disk, and throughput is much faster for
new file creation, especially for small files.

-Tom

On Tue, Dec 5, 2017 at 2:14 PM, Tom Fite <tomfite at gmail.com> wrote:
> Hi all,
>
> I have a distributed / replicated pool consisting of 2 boxes, with 3
> bricks a piece. Each brick is mounted via a RAID 6 array consisting of 11 6
> TB disks. I'm running CentOS 7 with XFS and LVM. The 150 TB pool is
loaded
> with about 15 TB of data. Clients are connected via FUSE. I'm using
> glusterfs 3.12.1.
>
> I've found that running large rsyncs to populate the pool are taking a
> very long time, specifically with small files (write throughput is fine). I
> know, I know -- small files on gluster do not perform well, but I'm
seeing
> particularly terrible performance in the range of around 25 to 50 creates
> per second.
>
> Profiling and testing indicate the main bottleneck is lstat calls on
> glusterfs metadata. Running an strace against the glusterd PIDs during a
> migration shows a lot of lstat calls taking a relatively long time to
> complete:
>
> strace -Tfp 3544 -p 3550 -p 3536 2>&1 >/dev/null | awk
> '{gsub(/[<>]/,"",$NF)}$NF+.0>0.5' | grep -Ev
> "futex|epoll|select|nanosleep"
> > 500 ms
> [pid 3748] <... lstat resumed> 0x7f6db004f220) = -1 ENOENT (No such
file
> or directory) 0.773194
> [pid 29234] <... lstat resumed> 0x7f4c500ac220) = -1 ENOENT (No such
file
> or directory) 1.010627
> [pid 13083] <... lstat resumed> 0x7f1c3416c220) = -1 ENOENT (No such
file
> or directory) 0.629203
>
> These lstats can be traced back to calls that look similar to this:
>
> [pid 31570]
lstat("/data/brick1/gv0/.glusterfs/1a/61/1a616193-ddef-453b-a86d-dea73c7da496",
> 0x7f1778067220) = -1 ENOENT (No such file or directory) 0.102771
> [pid 31568]
lstat("/data/brick1/gv0/.glusterfs/7f/0b/7f0bf1d3-b3e9-4009-9692-4e2e55c6c822",
> 0x7f17780e9220) = -1 ENOENT (No such file or directory) 0.052719
> [pid 31564]
lstat("/data/brick2/gv0/.glusterfs/b0/49/b049a03c-114a-443c-bdfc-71ee981d8e84",
> 0x7f296c575220) = -1 ENOENT (No such file or directory) 0.195458
>
> My theory is this: as the gluster pool fills with data, the .glusterfs
> metadata is scattered around disk, causing random IO seek times to
> increase. Each file create causes a read / seek in the glusterfs metadata
> folders to a non existent file, which takes a long time to look up due to
> the random nature of the directory hashes. If this is the root of the
> problem, this isn't specifically a problem with gluster per se, but a
> problem with LVM, XFS, RAID configuration, or my drives.
>
> This bug report might be the same issue: https://bugzilla.redhat.com/
> show_bug.cgi?id=1200457
>
> I wanted to check with the group to see if anybody else has run into this
> before if there are suggestions that might help. Specifically --
>
> 1. Would adding an SSD hot tier to my pool help here? Is the glusterfs
> metadata cached in the hot tier or does hot tiering only cache frequently
> accessed files in the pool?
>
> 2. I have had some success with forcing the glusterfs dirents into system
> cache, by running a find in the .glusterfs directory to enumerate and warm
> the cache with all dirents, eliminating seeks on disk. However, I'm at
the
> mercy of the OS here, and as soon as the dirents are dropped things get
> slow again. Anybody know of a way to keep the metadata in cache? I have 128
> GB of RAM to work with, so I should be able to aggressively cache.
>
> 3. Are giant RAID 6 arrays just not going to perform well here? Would more
> bricks / smaller array sizes or a different RAID level help?
>
> 4. Would adding more servers to gluster pool help or hurt?
>
> Here's my glusterfs config, I've been trying every optimization
tweak that
> I can find, including md-cache, bumping up cache sizes, bumping event
> threads, etc...
>
> Volume Name: gv0
> Type: Distributed-Replicate
> Volume ID: [ID]
> Status: Started
> Snapshot Count: 13
> Number of Bricks: 3 x 2 = 6
> Transport-type: tcp
> Bricks:
> Brick1: pod-sjc1-gluster1.exavault.com:/data/brick1/gv0
> Brick2: pod-sjc1-gluster2.exavault.com:/data/brick1/gv0
> Brick3: pod-sjc1-gluster1.exavault.com:/data/brick2/gv0
> Brick4: pod-sjc1-gluster2.exavault.com:/data/brick2/gv0
> Brick5: pod-sjc1-gluster1.exavault.com:/data/brick3/gv0
> Brick6: pod-sjc1-gluster2.exavault.com:/data/brick3/gv0
> Options Reconfigured:
> performance.cache-refresh-timeout: 60
> performance.stat-prefetch: on
> server.outstanding-rpc-limit: 1024
> cluster.lookup-optimize: on
> performance.client-io-threads: on
> nfs.disable: on
> transport.address-family: inet
> features.barrier: disable
> client.event-threads: 16
> server.event-threads: 16
> performance.cache-size: 4GB
> network.inode-lru-limit: 90000
> performance.md-cache-timeout: 600
> performance.cache-invalidation: on
> features.cache-invalidation-timeout: 600
> features.cache-invalidation: on
> performance.quick-read: on
> performance.io-cache: on
> performance.nfs.write-behind-window-size: 512MB
> performance.write-behind-window-size: 4MB
> performance.nfs.io-threads: on
> network.tcp-window-size: 1048576
> performance.rda-cache-limit: 32MB
> performance.flush-behind: on
> server.allow-insecure: on
> auto-delete: enable
>
> Thanks
> -Tom
>
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20180112/3975dc8b/attachment.html>

Seemingly Similar Threads

Search for more possibly parallel threads

Gluster users - Dec 2017 - Slow seek times on stat calls to glusterfs metadata

[Gluster-users] Slow seek times on stat calls to glusterfs metadata

[Gluster-users] Slow seek times on stat calls to glusterfs metadata

Seemingly Similar Threads