Dmitry Antipov
2020-Nov-27 05:53 UTC
[Gluster-users] Poor performance on a server-class system vs. desktop
On 11/26/20 8:14 PM, Gionatan Danti wrote:> So I think you simply are CPU limited. I remember doing some tests with loopback RAM disks and finding that Gluster used 100% CPU (ie: full load on an entire core) when doing 4K random writes. Side > note: using synchronized (ie: fsync) 4k writes, I only get ~600 IOPs even when running both bricks on the same machine and backing them with RAM disks (in other words, with no network or disk > bottleneck).Thanks, it seems you're right. Running local replica 3 volume on 3x1Gb ramdisks, I'm seeing: top - 08:44:35 up 1 day, 11:51, 1 user, load average: 2.34, 1.94, 1.00 Tasks: 237 total, 2 running, 235 sleeping, 0 stopped, 0 zombie %Cpu(s): 38.7 us, 29.4 sy, 0.0 ni, 23.6 id, 0.0 wa, 0.4 hi, 7.9 si, 0.0 st MiB Mem : 15889.8 total, 1085.7 free, 1986.3 used, 12817.8 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 12307.3 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 63651 root 20 0 664124 41676 9600 R 166.7 0.3 0:24.20 fio 63282 root 20 0 1235336 21484 8768 S 120.4 0.1 2:43.73 glusterfsd 63298 root 20 0 1235368 20512 8856 S 120.0 0.1 2:42.43 glusterfsd 63314 root 20 0 1236392 21396 8684 S 119.8 0.1 2:41.94 glusterfsd So, 32-core server-class system with a lot of RAM can't perform much faster for an individual I/O client - it just scales better if there are a lot of clients, right? Dmitry
Gionatan Danti
2020-Nov-27 08:33 UTC
[Gluster-users] Poor performance on a server-class system vs. desktop
Il 2020-11-27 06:53 Dmitry Antipov ha scritto:> Thanks, it seems you're right. Running local replica 3 volume on 3x1Gb > ramdisks, I'm seeing: > > top - 08:44:35 up 1 day, 11:51, 1 user, load average: 2.34, 1.94, > 1.00 > Tasks: 237 total, 2 running, 235 sleeping, 0 stopped, 0 zombie > %Cpu(s): 38.7 us, 29.4 sy, 0.0 ni, 23.6 id, 0.0 wa, 0.4 hi, 7.9 si, > 0.0 st > MiB Mem : 15889.8 total, 1085.7 free, 1986.3 used, 12817.8 > buff/cache > MiB Swap: 0.0 total, 0.0 free, 0.0 used. 12307.3 avail > Mem > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ > COMMAND > 63651 root 20 0 664124 41676 9600 R 166.7 0.3 0:24.20 > fio > 63282 root 20 0 1235336 21484 8768 S 120.4 0.1 2:43.73 > glusterfsd > 63298 root 20 0 1235368 20512 8856 S 120.0 0.1 2:42.43 > glusterfsd > 63314 root 20 0 1236392 21396 8684 S 119.8 0.1 2:41.94 > glusterfsd > > So, 32-core server-class system with a lot of RAM can't perform much > faster for an > individual I/O client - it just scales better if there are a lot of > clients, right?Yes, it should scale with additional clients and bricks. As a side note, this high-cpu, (relatively) low-perf result was the reason why I abandoned the idea to use a 3-way Gluster as backing store for hyperconverged KVM setup (with VMs running on the same Gluster host): while adequate for "normal" VMs, it would not fit the bill for high performance guest. Increasing the number of bricks/clients would ameliorate the situation, but we are suddenly in the "rack full of gluster server" setup (which is not compatible with my customers requests). If anyone has some suggestions, I am all ears! Regards. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti at assyoma.it - info at assyoma.it GPG public key ID: FF5F32A8
Amar Tumballi
2020-Nov-27 08:40 UTC
[Gluster-users] Poor performance on a server-class system vs. desktop
Top posting as my observations are general and doesn't speak anything specific to the problem at hand, and what are our ideas to improve it. Thanks Dmitry for a good thread :-) I will try to break this into a long answer, but will give short answer for question. Does a single thread user app take a huge benefit from larger RAM/CPU ? - *NO. * So, how is distributed storage performance measured? - By running as many threads (and different client mounts) as possible to saturate the n/w on servers. Let's get to longer look into performance: First of all, when we talk performance of the local storage Vs network storage Vs distributed storage multiple things needs to be considered: Local Storage (lets say NVMe/SSD): User App -> Kernel (ie, a syscall) -> Access harddrive. (This is one way, the call returns in the same path). Network storage (Say NFS): User App -> kernel (nfs client through syscall) -> network call -> Server process (nfsd) -> kernel (syscall on the storage machine) -> Access harddrive (Reverse path also needs to be traversed to complete the call). Distributed Storage (Say GlusterFS): User App -> Kernel (syscall to fuse) -> glusterfs client (callback from fuse) -> network call -> glusterfsd -> kernel (syscall) -> access to harddrive (reverse path for completing the call). Historically, Disk and Network were the slowest part here, so the 'kernel' part was almost non-existent as a bottleneck. Gluster did well with aggregation, and a linear performance improvement as long as this was true. Ie, your network and disk were a significant % bottleneck of your storage stack. The linear scale-out is true even today with NVMe and faster networks, but the % difference from that of individual local storage performance to glusterfs performance has increased mainly because of the more layers it traverses now. What we are observing now with 100Gbps network and NVMe drives is, most of the bottlenecks seen in network layer and disk are going away, and the bottleneck is visible in the way we do certain operations inside of glusterfs performance. Of late, we are noticing the bottlenecks are in number of system calls we do as part of a single call user does. For example, if you enable all the features of gluster, a single open call would translate into 10s of calls on the disk (stat()/getxattr(){s}/open(). This results in some delay. Also with a process which utilizes many CPU cores, there is a penalty when synchronization happens (and being distributed, multi threaded, multi client architecture, glusterfs uses multiple locks). We are working towards a unified caching translator, which would reduce access to disk, which means we reduce many systemcalls made to disk. Also we are aware network layer is a bottleneck (with XDR formating and the way we process RPC packages). But taking up network layer optimizations (and also use RDMA effectively) is a larger task. We are looking for volunteers to pick up this network enhancement task which would benefit a lot. Now, coming back to the subject, more the CPUs, same test is showing lesser performance gain because your locks would be taking more % bottleneck than in your Laptop. Can you try running the same test with restricting the number of Cores the glusterfsd uses to 4 and retry the test? Regards, Amar On Fri, Nov 27, 2020 at 11:23 AM Dmitry Antipov <dmantipov at yandex.ru> wrote:> On 11/26/20 8:14 PM, Gionatan Danti wrote: > > > So I think you simply are CPU limited. I remember doing some tests with > loopback RAM disks and finding that Gluster used 100% CPU (ie: full load on > an entire core) when doing 4K random writes. Side > > note: using synchronized (ie: fsync) 4k writes, I only get ~600 IOPs > even when running both bricks on the same machine and backing them with RAM > disks (in other words, with no network or disk > > bottleneck). > > Thanks, it seems you're right. Running local replica 3 volume on 3x1Gb > ramdisks, I'm seeing: > > top - 08:44:35 up 1 day, 11:51, 1 user, load average: 2.34, 1.94, 1.00 > Tasks: 237 total, 2 running, 235 sleeping, 0 stopped, 0 zombie > %Cpu(s): 38.7 us, 29.4 sy, 0.0 ni, 23.6 id, 0.0 wa, 0.4 hi, 7.9 si, > 0.0 st > MiB Mem : 15889.8 total, 1085.7 free, 1986.3 used, 12817.8 buff/cache > MiB Swap: 0.0 total, 0.0 free, 0.0 used. 12307.3 avail Mem > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ > COMMAND > 63651 root 20 0 664124 41676 9600 R 166.7 0.3 0:24.20 fio > 63282 root 20 0 1235336 21484 8768 S 120.4 0.1 2:43.73 > glusterfsd > 63298 root 20 0 1235368 20512 8856 S 120.0 0.1 2:42.43 > glusterfsd > 63314 root 20 0 1236392 21396 8684 S 119.8 0.1 2:41.94 > glusterfsd > > So, 32-core server-class system with a lot of RAM can't perform much > faster for an > individual I/O client - it just scales better if there are a lot of > clients, right? > > Dmitry > ________ > > > > Community Meeting Calendar: > > Schedule - > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > Bridge: https://meet.google.com/cpu-eiue-hvk > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users >-- -- https://kadalu.io Container Storage made easy! -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20201127/06241a8c/attachment.html>