Tom Fite
2017-Dec-05 19:14 UTC
[Gluster-users] Slow seek times on stat calls to glusterfs metadata
Hi all, I have a distributed / replicated pool consisting of 2 boxes, with 3 bricks a piece. Each brick is mounted via a RAID 6 array consisting of 11 6 TB disks. I'm running CentOS 7 with XFS and LVM. The 150 TB pool is loaded with about 15 TB of data. Clients are connected via FUSE. I'm using glusterfs 3.12.1. I've found that running large rsyncs to populate the pool are taking a very long time, specifically with small files (write throughput is fine). I know, I know -- small files on gluster do not perform well, but I'm seeing particularly terrible performance in the range of around 25 to 50 creates per second. Profiling and testing indicate the main bottleneck is lstat calls on glusterfs metadata. Running an strace against the glusterd PIDs during a migration shows a lot of lstat calls taking a relatively long time to complete: strace -Tfp 3544 -p 3550 -p 3536 2>&1 >/dev/null | awk '{gsub(/[<>]/,"",$NF)}$NF+.0>0.5' | grep -Ev "futex|epoll|select|nanosleep"> 500 ms[pid 3748] <... lstat resumed> 0x7f6db004f220) = -1 ENOENT (No such file or directory) 0.773194 [pid 29234] <... lstat resumed> 0x7f4c500ac220) = -1 ENOENT (No such file or directory) 1.010627 [pid 13083] <... lstat resumed> 0x7f1c3416c220) = -1 ENOENT (No such file or directory) 0.629203 These lstats can be traced back to calls that look similar to this: [pid 31570] lstat("/data/brick1/gv0/.glusterfs/1a/61/1a616193-ddef-453b-a86d-dea73c7da496", 0x7f1778067220) = -1 ENOENT (No such file or directory) 0.102771 [pid 31568] lstat("/data/brick1/gv0/.glusterfs/7f/0b/7f0bf1d3-b3e9-4009-9692-4e2e55c6c822", 0x7f17780e9220) = -1 ENOENT (No such file or directory) 0.052719 [pid 31564] lstat("/data/brick2/gv0/.glusterfs/b0/49/b049a03c-114a-443c-bdfc-71ee981d8e84", 0x7f296c575220) = -1 ENOENT (No such file or directory) 0.195458 My theory is this: as the gluster pool fills with data, the .glusterfs metadata is scattered around disk, causing random IO seek times to increase. Each file create causes a read / seek in the glusterfs metadata folders to a non existent file, which takes a long time to look up due to the random nature of the directory hashes. If this is the root of the problem, this isn't specifically a problem with gluster per se, but a problem with LVM, XFS, RAID configuration, or my drives. This bug report might be the same issue: https://bugzilla.redhat.com/show_bug.cgi?id=1200457 I wanted to check with the group to see if anybody else has run into this before if there are suggestions that might help. Specifically -- 1. Would adding an SSD hot tier to my pool help here? Is the glusterfs metadata cached in the hot tier or does hot tiering only cache frequently accessed files in the pool? 2. I have had some success with forcing the glusterfs dirents into system cache, by running a find in the .glusterfs directory to enumerate and warm the cache with all dirents, eliminating seeks on disk. However, I'm at the mercy of the OS here, and as soon as the dirents are dropped things get slow again. Anybody know of a way to keep the metadata in cache? I have 128 GB of RAM to work with, so I should be able to aggressively cache. 3. Are giant RAID 6 arrays just not going to perform well here? Would more bricks / smaller array sizes or a different RAID level help? 4. Would adding more servers to gluster pool help or hurt? Here's my glusterfs config, I've been trying every optimization tweak that I can find, including md-cache, bumping up cache sizes, bumping event threads, etc... Volume Name: gv0 Type: Distributed-Replicate Volume ID: [ID] Status: Started Snapshot Count: 13 Number of Bricks: 3 x 2 = 6 Transport-type: tcp Bricks: Brick1: pod-sjc1-gluster1.exavault.com:/data/brick1/gv0 Brick2: pod-sjc1-gluster2.exavault.com:/data/brick1/gv0 Brick3: pod-sjc1-gluster1.exavault.com:/data/brick2/gv0 Brick4: pod-sjc1-gluster2.exavault.com:/data/brick2/gv0 Brick5: pod-sjc1-gluster1.exavault.com:/data/brick3/gv0 Brick6: pod-sjc1-gluster2.exavault.com:/data/brick3/gv0 Options Reconfigured: performance.cache-refresh-timeout: 60 performance.stat-prefetch: on server.outstanding-rpc-limit: 1024 cluster.lookup-optimize: on performance.client-io-threads: on nfs.disable: on transport.address-family: inet features.barrier: disable client.event-threads: 16 server.event-threads: 16 performance.cache-size: 4GB network.inode-lru-limit: 90000 performance.md-cache-timeout: 600 performance.cache-invalidation: on features.cache-invalidation-timeout: 600 features.cache-invalidation: on performance.quick-read: on performance.io-cache: on performance.nfs.write-behind-window-size: 512MB performance.write-behind-window-size: 4MB performance.nfs.io-threads: on network.tcp-window-size: 1048576 performance.rda-cache-limit: 32MB performance.flush-behind: on server.allow-insecure: on auto-delete: enable Thanks -Tom -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20171205/0eb42b4f/attachment.html>
Tom Fite
2018-Jan-12 17:45 UTC
[Gluster-users] Slow seek times on stat calls to glusterfs metadata
To follow up on this, I've added an SSD backed hot tier to my cluster and this dramatically improved performance. From observing iostat, it appears that all new files are created on the hot tier and migrated to the cold tier when the demotion daemon runs. Since new files use the hot tier, this avoids the stat() calls on spinning disk, and throughput is much faster for new file creation, especially for small files. -Tom On Tue, Dec 5, 2017 at 2:14 PM, Tom Fite <tomfite at gmail.com> wrote:> Hi all, > > I have a distributed / replicated pool consisting of 2 boxes, with 3 > bricks a piece. Each brick is mounted via a RAID 6 array consisting of 11 6 > TB disks. I'm running CentOS 7 with XFS and LVM. The 150 TB pool is loaded > with about 15 TB of data. Clients are connected via FUSE. I'm using > glusterfs 3.12.1. > > I've found that running large rsyncs to populate the pool are taking a > very long time, specifically with small files (write throughput is fine). I > know, I know -- small files on gluster do not perform well, but I'm seeing > particularly terrible performance in the range of around 25 to 50 creates > per second. > > Profiling and testing indicate the main bottleneck is lstat calls on > glusterfs metadata. Running an strace against the glusterd PIDs during a > migration shows a lot of lstat calls taking a relatively long time to > complete: > > strace -Tfp 3544 -p 3550 -p 3536 2>&1 >/dev/null | awk > '{gsub(/[<>]/,"",$NF)}$NF+.0>0.5' | grep -Ev > "futex|epoll|select|nanosleep" > > 500 ms > [pid 3748] <... lstat resumed> 0x7f6db004f220) = -1 ENOENT (No such file > or directory) 0.773194 > [pid 29234] <... lstat resumed> 0x7f4c500ac220) = -1 ENOENT (No such file > or directory) 1.010627 > [pid 13083] <... lstat resumed> 0x7f1c3416c220) = -1 ENOENT (No such file > or directory) 0.629203 > > These lstats can be traced back to calls that look similar to this: > > [pid 31570] lstat("/data/brick1/gv0/.glusterfs/1a/61/1a616193-ddef-453b-a86d-dea73c7da496", > 0x7f1778067220) = -1 ENOENT (No such file or directory) 0.102771 > [pid 31568] lstat("/data/brick1/gv0/.glusterfs/7f/0b/7f0bf1d3-b3e9-4009-9692-4e2e55c6c822", > 0x7f17780e9220) = -1 ENOENT (No such file or directory) 0.052719 > [pid 31564] lstat("/data/brick2/gv0/.glusterfs/b0/49/b049a03c-114a-443c-bdfc-71ee981d8e84", > 0x7f296c575220) = -1 ENOENT (No such file or directory) 0.195458 > > My theory is this: as the gluster pool fills with data, the .glusterfs > metadata is scattered around disk, causing random IO seek times to > increase. Each file create causes a read / seek in the glusterfs metadata > folders to a non existent file, which takes a long time to look up due to > the random nature of the directory hashes. If this is the root of the > problem, this isn't specifically a problem with gluster per se, but a > problem with LVM, XFS, RAID configuration, or my drives. > > This bug report might be the same issue: https://bugzilla.redhat.com/ > show_bug.cgi?id=1200457 > > I wanted to check with the group to see if anybody else has run into this > before if there are suggestions that might help. Specifically -- > > 1. Would adding an SSD hot tier to my pool help here? Is the glusterfs > metadata cached in the hot tier or does hot tiering only cache frequently > accessed files in the pool? > > 2. I have had some success with forcing the glusterfs dirents into system > cache, by running a find in the .glusterfs directory to enumerate and warm > the cache with all dirents, eliminating seeks on disk. However, I'm at the > mercy of the OS here, and as soon as the dirents are dropped things get > slow again. Anybody know of a way to keep the metadata in cache? I have 128 > GB of RAM to work with, so I should be able to aggressively cache. > > 3. Are giant RAID 6 arrays just not going to perform well here? Would more > bricks / smaller array sizes or a different RAID level help? > > 4. Would adding more servers to gluster pool help or hurt? > > Here's my glusterfs config, I've been trying every optimization tweak that > I can find, including md-cache, bumping up cache sizes, bumping event > threads, etc... > > Volume Name: gv0 > Type: Distributed-Replicate > Volume ID: [ID] > Status: Started > Snapshot Count: 13 > Number of Bricks: 3 x 2 = 6 > Transport-type: tcp > Bricks: > Brick1: pod-sjc1-gluster1.exavault.com:/data/brick1/gv0 > Brick2: pod-sjc1-gluster2.exavault.com:/data/brick1/gv0 > Brick3: pod-sjc1-gluster1.exavault.com:/data/brick2/gv0 > Brick4: pod-sjc1-gluster2.exavault.com:/data/brick2/gv0 > Brick5: pod-sjc1-gluster1.exavault.com:/data/brick3/gv0 > Brick6: pod-sjc1-gluster2.exavault.com:/data/brick3/gv0 > Options Reconfigured: > performance.cache-refresh-timeout: 60 > performance.stat-prefetch: on > server.outstanding-rpc-limit: 1024 > cluster.lookup-optimize: on > performance.client-io-threads: on > nfs.disable: on > transport.address-family: inet > features.barrier: disable > client.event-threads: 16 > server.event-threads: 16 > performance.cache-size: 4GB > network.inode-lru-limit: 90000 > performance.md-cache-timeout: 600 > performance.cache-invalidation: on > features.cache-invalidation-timeout: 600 > features.cache-invalidation: on > performance.quick-read: on > performance.io-cache: on > performance.nfs.write-behind-window-size: 512MB > performance.write-behind-window-size: 4MB > performance.nfs.io-threads: on > network.tcp-window-size: 1048576 > performance.rda-cache-limit: 32MB > performance.flush-behind: on > server.allow-insecure: on > auto-delete: enable > > Thanks > -Tom > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180112/3975dc8b/attachment.html>