Artem Russakovskii
2018-Feb-27 13:52 UTC
[Gluster-users] Very slow rsync to gluster volume UNLESS `ls` or `find` scan dir on gluster volume first
Any updates on this one? On Mon, Feb 5, 2018 at 8:18 AM, Tom Fite <tomfite at gmail.com> wrote:> Hi all, > > I have seen this issue as well, on Gluster 3.12.1. (3 bricks per box, 2 > boxes, distributed-replicate) My testing shows the same thing -- running a > find on a directory dramatically increases lstat performance. To add > another clue, the performance degrades again after issuing a call to reset > the system's cache of dentries and inodes: > > # sync; echo 2 > /proc/sys/vm/drop_caches > > I think that this shows that it's the system cache that's actually doing > the heavy lifting here. There are a couple of sysctl tunables that I've > found helps out with this. > > See here: > > http://docs.gluster.org/en/latest/Administrator%20Guide/ > Linux%20Kernel%20Tuning/ > > Contrary to what that doc says, I've found that setting > vm.vfs_cache_pressure to a low value increases performance by allowing more > dentries and inodes to be retained in the cache. > > # Set the swappiness to avoid swap when possible. > vm.swappiness = 10 > > # Set the cache pressure to prefer inode and dentry cache over file cache. > This is done to keep as many > # dentries and inodes in cache as possible, which dramatically improves > gluster small file performance. > vm.vfs_cache_pressure = 25 > > For comparison, my config is: > > Volume Name: gv0 > Type: Tier > Volume ID: d490a9ec-f9c8-4f10-a7f3-e1b6d3ced196 > Status: Started > Snapshot Count: 13 > Number of Bricks: 8 > Transport-type: tcp > Hot Tier : > Hot Tier Type : Replicate > Number of Bricks: 1 x 2 = 2 > Brick1: gluster2:/data/hot_tier/gv0 > Brick2: gluster1:/data/hot_tier/gv0 > Cold Tier: > Cold Tier Type : Distributed-Replicate > Number of Bricks: 3 x 2 = 6 > Brick3: gluster1:/data/brick1/gv0 > Brick4: gluster2:/data/brick1/gv0 > Brick5: gluster1:/data/brick2/gv0 > Brick6: gluster2:/data/brick2/gv0 > Brick7: gluster1:/data/brick3/gv0 > Brick8: gluster2:/data/brick3/gv0 > Options Reconfigured: > performance.cache-max-file-size: 128MB > cluster.readdir-optimize: on > cluster.watermark-hi: 95 > features.ctr-sql-db-cachesize: 262144 > cluster.read-freq-threshold: 5 > cluster.write-freq-threshold: 2 > features.record-counters: on > cluster.tier-promote-frequency: 15000 > cluster.tier-pause: off > cluster.tier-compact: on > cluster.tier-mode: cache > features.ctr-enabled: on > performance.cache-refresh-timeout: 60 > performance.stat-prefetch: on > server.outstanding-rpc-limit: 2056 > cluster.lookup-optimize: on > performance.client-io-threads: off > nfs.disable: on > transport.address-family: inet > features.barrier: disable > client.event-threads: 4 > server.event-threads: 4 > performance.cache-size: 1GB > network.inode-lru-limit: 90000 > performance.md-cache-timeout: 600 > performance.cache-invalidation: on > features.cache-invalidation-timeout: 600 > features.cache-invalidation: on > performance.quick-read: on > performance.io-cache: on > performance.nfs.write-behind-window-size: 4MB > performance.write-behind-window-size: 4MB > performance.nfs.io-threads: off > network.tcp-window-size: 1048576 > performance.rda-cache-limit: 64MB > performance.flush-behind: on > server.allow-insecure: on > cluster.tier-demote-frequency: 18000 > cluster.tier-max-files: 1000000 > cluster.tier-max-promote-file-size: 10485760 > cluster.tier-max-mb: 64000 > features.ctr-sql-db-wal-autocheckpoint: 2500 > cluster.tier-hot-compact-frequency: 86400 > cluster.tier-cold-compact-frequency: 86400 > performance.readdir-ahead: off > cluster.watermark-low: 50 > storage.build-pgfid: on > performance.rda-request-size: 128KB > performance.rda-low-wmark: 4KB > cluster.min-free-disk: 5% > auto-delete: enable > > > On Sun, Feb 4, 2018 at 9:44 PM, Amar Tumballi <atumball at redhat.com> wrote: > >> Thanks for the report Artem, >> >> Looks like the issue is about cache warming up. Specially, I suspect >> rsync doing a 'readdir(), stat(), file operations' loop, where as when a >> find or ls is issued, we get 'readdirp()' request, which contains the stat >> information along with entries, which also makes sure cache is up-to-date >> (at md-cache layer). >> >> Note that this is just a off-the memory hypothesis, We surely need to >> analyse and debug more thoroughly for a proper explanation. Some one in my >> team would look at it soon. >> >> Regards, >> Amar >> >> On Mon, Feb 5, 2018 at 7:25 AM, Vlad Kopylov <vladkopy at gmail.com> wrote: >> >>> You mounting it to the local bricks? >>> >>> struggling with same performance issues >>> try using this volume setting >>> http://lists.gluster.org/pipermail/gluster-users/2018-Januar >>> y/033397.html >>> performance.stat-prefetch: on might be it >>> >>> seems like when it gets to cache it is fast - those stat fetch which >>> seem to come from .gluster are slow >>> >>> On Sun, Feb 4, 2018 at 3:45 AM, Artem Russakovskii <archon810 at gmail.com> >>> wrote: >>> > An update, and a very interesting one! >>> > >>> > After I started stracing rsync, all I could see was lstat calls, quite >>> slow >>> > ones, over and over, which is expected. >>> > >>> > For example: lstat("uploads/2016/10/nexus2c >>> ee_DSC05339_thumb-161x107.jpg", >>> > {st_mode=S_IFREG|0664, st_size=4043, ...}) = 0 >>> > >>> > I googled around and found >>> > https://gist.github.com/nh2/1836415489e2132cf85ed3832105fcc1, which is >>> > seeing this exact issue with gluster, rsync and xfs. >>> > >>> > Here's the craziest finding so far. If while rsync is running (or right >>> > before), I run /bin/ls or find on the same gluster dirs, it immediately >>> > speeds up rsync by a factor of 100 or maybe even 1000. It's absolutely >>> > insane. >>> > >>> > I'm stracing the rsync run, and the slow lstat calls flood in at an >>> > incredible speed as soon as ls or find run. Several hundred of files >>> per >>> > minute (excruciatingly slow) becomes thousands or even tens of >>> thousands of >>> > files a second. >>> > >>> > What do you make of this? >>> > >>> > >>> > _______________________________________________ >>> > Gluster-users mailing list >>> > Gluster-users at gluster.org >>> > http://lists.gluster.org/mailman/listinfo/gluster-users >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> http://lists.gluster.org/mailman/listinfo/gluster-users >>> >> >> >> >> -- >> Amar Tumballi (amarts) >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> http://lists.gluster.org/mailman/listinfo/gluster-users >> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180227/88045149/attachment.html>
Ingard Mevåg
2018-Feb-27 14:15 UTC
[Gluster-users] Very slow rsync to gluster volume UNLESS `ls` or `find` scan dir on gluster volume first
We got extremely slow stat calls on our disperse cluster running latest 3.12 with clients also running 3.12. When we downgraded clients to 3.10 the slow stat problem went away. We later found out that by disabling disperse.eager-lock we could run the 3.12 clients without much issue (a little bit slower writes) There is an open issue on this here : https://bugzilla.redhat.com/show_bug.cgi?id=1546732 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180227/2167fd7e/attachment.html>
Artem Russakovskii
2018-Apr-18 04:31 UTC
[Gluster-users] Very slow rsync to gluster volume UNLESS `ls` or `find` scan dir on gluster volume first
Nithya, Amar, Any movement here? There could be a significant performance gain here that may also affect other bottlenecks that I'm experiencing which make gluster close to unusable at times. Sincerely, Artem -- Founder, Android Police <http://www.androidpolice.com>, APK Mirror <http://www.apkmirror.com/>, Illogical Robot LLC beerpla.net | +ArtemRussakovskii <https://plus.google.com/+ArtemRussakovskii> | @ArtemR <http://twitter.com/ArtemR> On Tue, Feb 27, 2018 at 6:15 AM, Ingard Mev?g <ingard at jotta.no> wrote:> We got extremely slow stat calls on our disperse cluster running latest > 3.12 with clients also running 3.12. > When we downgraded clients to 3.10 the slow stat problem went away. > > We later found out that by disabling disperse.eager-lock we could run the > 3.12 clients without much issue (a little bit slower writes) > There is an open issue on this here : https://bugzilla.redhat.com/ > show_bug.cgi?id=1546732 >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180417/9ab3d736/attachment.html>
Apparently Analagous Threads
- performance.cache-size for high-RAM clients/servers, other tweaks for performance, and improvements to Gluster docs
- performance.cache-size for high-RAM clients/servers, other tweaks for performance, and improvements to Gluster docs
- Getting glusterfs to expand volume size to brick size
- performance.cache-size for high-RAM clients/servers, other tweaks for performance, and improvements to Gluster docs
- Getting glusterfs to expand volume size to brick size