thr3ads.net - Gluster users - [Gluster-users] Poor performance on a server-class system vs. desktop [Nov 2020]

If this information is useful, please help other people find it:
Share via:

Dmitry Antipov

2020-Nov-27 05:53 UTC

[Gluster-users] Poor performance on a server-class system vs. desktop

On 11/26/20 8:14 PM, Gionatan Danti wrote:
> So I think you simply are CPU limited. I remember doing some tests with
loopback RAM disks and finding that Gluster used 100% CPU (ie: full load on an
entire core) when doing 4K random writes. Side
> note: using synchronized (ie: fsync) 4k writes, I only get ~600 IOPs even
when running both bricks on the same machine and backing them with RAM disks (in
other words, with no network or disk
> bottleneck).
Thanks, it seems you're right. Running local replica 3 volume on 3x1Gb
ramdisks, I'm seeing:

top - 08:44:35 up 1 day, 11:51,  1 user,  load average: 2.34, 1.94, 1.00
Tasks: 237 total,   2 running, 235 sleeping,   0 stopped,   0 zombie
%Cpu(s): 38.7 us, 29.4 sy,  0.0 ni, 23.6 id,  0.0 wa,  0.4 hi,  7.9 si,  0.0 st
MiB Mem :  15889.8 total,   1085.7 free,   1986.3 used,  12817.8 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  12307.3 avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
63651 root      20   0  664124  41676   9600 R 166.7   0.3   0:24.20 fio
63282 root      20   0 1235336  21484   8768 S 120.4   0.1   2:43.73 glusterfsd
63298 root      20   0 1235368  20512   8856 S 120.0   0.1   2:42.43 glusterfsd
63314 root      20   0 1236392  21396   8684 S 119.8   0.1   2:41.94 glusterfsd

So, 32-core server-class system with a lot of RAM can't perform much faster
for an
individual I/O client - it just scales better if there are a lot of clients,
right?

Dmitry

Gionatan Danti

2020-Nov-27 08:33 UTC

head link

[Gluster-users] Poor performance on a server-class system vs. desktop

Il 2020-11-27 06:53 Dmitry Antipov ha scritto:> Thanks, it seems you're right. Running local replica 3 volume on 3x1Gb
> ramdisks, I'm seeing:
> 
> top - 08:44:35 up 1 day, 11:51,  1 user,  load average: 2.34, 1.94, 
> 1.00
> Tasks: 237 total,   2 running, 235 sleeping,   0 stopped,   0 zombie
> %Cpu(s): 38.7 us, 29.4 sy,  0.0 ni, 23.6 id,  0.0 wa,  0.4 hi,  7.9 si, 
>  0.0 st
> MiB Mem :  15889.8 total,   1085.7 free,   1986.3 used,  12817.8 
> buff/cache
> MiB Swap:      0.0 total,      0.0 free,      0.0 used.  12307.3 avail 
> Mem
> 
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ 
> COMMAND
> 63651 root      20   0  664124  41676   9600 R 166.7   0.3   0:24.20 
> fio
> 63282 root      20   0 1235336  21484   8768 S 120.4   0.1   2:43.73 
> glusterfsd
> 63298 root      20   0 1235368  20512   8856 S 120.0   0.1   2:42.43 
> glusterfsd
> 63314 root      20   0 1236392  21396   8684 S 119.8   0.1   2:41.94 
> glusterfsd
> 
> So, 32-core server-class system with a lot of RAM can't perform much
> faster for an
> individual I/O client - it just scales better if there are a lot of
> clients, right?
Yes, it should scale with additional clients and bricks.

As a side note, this high-cpu, (relatively) low-perf result was the 
reason why I abandoned the idea to use a 3-way Gluster as backing store 
for hyperconverged KVM setup (with VMs running on the same Gluster 
host): while adequate for "normal" VMs, it would not fit the bill for 
high performance guest.

Increasing the number of bricks/clients would ameliorate the situation, 
but we are suddenly in the "rack full of gluster server" setup (which
is
not compatible with my customers requests).

If anyone has some suggestions, I am all ears!
Regards.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti at assyoma.it - info at assyoma.it
GPG public key ID: FF5F32A8

Amar Tumballi

2020-Nov-27 08:40 UTC

head link

[Gluster-users] Poor performance on a server-class system vs. desktop

Top posting as my observations are general and doesn't speak anything
specific to the problem at hand, and what are our ideas to improve it.

Thanks Dmitry for a good thread :-)

I will try to break this into a long answer, but will give short answer for
question.

Does a single thread user app take a huge benefit from larger RAM/CPU ? -
*NO. *
So, how is distributed storage performance measured? - By running as many
threads (and different client mounts) as possible to saturate the n/w on
servers.

Let's get to longer look into performance:

First of all, when we talk performance of the local storage Vs network
storage Vs distributed storage multiple things needs to be considered:

Local Storage (lets say NVMe/SSD):  User App -> Kernel (ie, a syscall) ->
Access harddrive. (This is one way, the call returns in the same path).
Network storage (Say NFS): User App -> kernel (nfs client through syscall)
-> network call -> Server process (nfsd) -> kernel (syscall on the
storage
machine) -> Access harddrive (Reverse path also needs to be traversed to
complete the call).
Distributed Storage (Say GlusterFS): User App -> Kernel (syscall to fuse)
-> glusterfs client (callback from fuse) -> network call -> glusterfsd
->
kernel (syscall) -> access to harddrive (reverse path for completing the
call).

Historically, Disk and Network were the slowest part here, so the
'kernel'
part was almost non-existent as a bottleneck. Gluster did well with
aggregation, and a linear performance improvement as long as this was true.
Ie, your network and disk were a significant % bottleneck of your storage
stack. The linear scale-out is true even today with NVMe and faster
networks, but the % difference from that of individual local storage
performance to glusterfs performance has increased mainly because of the
more layers it traverses now. What we are observing now with 100Gbps
network and NVMe drives is, most of the bottlenecks seen in network layer
and disk are going away, and the bottleneck is visible in the way we do
certain operations inside of glusterfs performance. Of late, we are
noticing the bottlenecks are in number of system calls we do as part of a
single call user does. For example, if you enable all the features of
gluster, a single open call would translate into 10s of calls on the disk
(stat()/getxattr(){s}/open().  This results in some delay. Also with a
process which utilizes many CPU cores, there is a penalty when
synchronization happens (and being distributed, multi threaded, multi
client architecture, glusterfs uses multiple locks).

We are working towards a unified caching translator, which would reduce
access to disk, which means we reduce many systemcalls made to disk. Also
we are aware network layer is a bottleneck (with XDR formating and the way
we process RPC packages). But taking up network layer optimizations (and
also use RDMA effectively) is a larger task.  We are looking for volunteers
to pick up this network enhancement task which would benefit a lot.

Now, coming back to the subject, more the CPUs, same test is showing lesser
performance gain because your locks would be taking more % bottleneck than
in your Laptop.  Can you try running the same test with restricting the
number of Cores the glusterfsd uses to 4 and retry the test?

Regards,
Amar

On Fri, Nov 27, 2020 at 11:23 AM Dmitry Antipov <dmantipov at yandex.ru>
wrote:
> On 11/26/20 8:14 PM, Gionatan Danti wrote:
>
> > So I think you simply are CPU limited. I remember doing some tests
with
> loopback RAM disks and finding that Gluster used 100% CPU (ie: full load on
> an entire core) when doing 4K random writes. Side
> > note: using synchronized (ie: fsync) 4k writes, I only get ~600 IOPs
> even when running both bricks on the same machine and backing them with RAM
> disks (in other words, with no network or disk
> > bottleneck).
>
> Thanks, it seems you're right. Running local replica 3 volume on 3x1Gb
> ramdisks, I'm seeing:
>
> top - 08:44:35 up 1 day, 11:51,  1 user,  load average: 2.34, 1.94, 1.00
> Tasks: 237 total,   2 running, 235 sleeping,   0 stopped,   0 zombie
> %Cpu(s): 38.7 us, 29.4 sy,  0.0 ni, 23.6 id,  0.0 wa,  0.4 hi,  7.9 si,
> 0.0 st
> MiB Mem :  15889.8 total,   1085.7 free,   1986.3 used,  12817.8 buff/cache
> MiB Swap:      0.0 total,      0.0 free,      0.0 used.  12307.3 avail Mem
>
>    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+
> COMMAND
> 63651 root      20   0  664124  41676   9600 R 166.7   0.3   0:24.20 fio
> 63282 root      20   0 1235336  21484   8768 S 120.4   0.1   2:43.73
> glusterfsd
> 63298 root      20   0 1235368  20512   8856 S 120.0   0.1   2:42.43
> glusterfsd
> 63314 root      20   0 1236392  21396   8684 S 119.8   0.1   2:41.94
> glusterfsd
>
> So, 32-core server-class system with a lot of RAM can't perform much
> faster for an
> individual I/O client - it just scales better if there are a lot of
> clients, right?
>
> Dmitry
> ________
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://meet.google.com/cpu-eiue-hvk
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
>

-- 
--
https://kadalu.io
Container Storage made easy!
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20201127/06241a8c/attachment.html>

Gluster users - Nov 2020 - Poor performance on a server-class system vs. desktop

[Gluster-users] Poor performance on a server-class system vs. desktop

[Gluster-users] Poor performance on a server-class system vs. desktop

[Gluster-users] Poor performance on a server-class system vs. desktop