thr3ads.net - Gluster users - [Gluster-users] gluster 4.1.6 brick problems: 2 processes for one brick, performance problems [Dec 2018]

If this information is useful, please help other people find it:
Share via:

Hu Bert

2018-Dec-12 09:41 UTC

[Gluster-users] gluster 4.1.6 brick problems: 2 processes for one brick, performance problems

Hello,

we started with a gluster installation: 3.12.11. 3 servers (gluster11,
gluster12, gluster13) and 4 bricks (each hdd == brick, JBOD behind
controller) per server: bricksda1, bricksdb1, bricksdc1, bricksdd1;
full information: see here: https://pastebin.com/0ndDSstG

In the beginning everything was running fine so far. In May one hdd
(sdd on gluster13) died and got replaced; i replaced the brick and the
self-heal started, taking weeks and worsening performance. One week
after the heal had finished another hdd (sdd on gluster12) died -> did
the same again, it again took weeks, bad performance etc.

After the replace/heal now the performance on most of the bricks was
ok, but 2 have a bad performance; in short:

gluster11: no hdd change, bricksd(a|b|c) ok, bricksdd takes much
longer for requests
gluster12: 1 hdd change, all bricks with normal performance
gluster13: 1 hdd change, bricksd(a|b|c) ok, bricksdd takes much longer
for requests

We've checked (thx to Pranith and Xavi) hardware, disks speed, gluster
settings etc., but only the 2 bricksdd on gluster11+13 take much
longer (>2x) for each request, worsening the overall gluster
performance. So something must be wrong, especially with bricksdd1.
Anyone knows how to check how to investigate this?

2nd problen: during all these checks and searches we upgraded
glusterfs from 3.12.11 -> 3.12.15 and finally to 4.1.6, but the
problems didn't disappear. Well, and some additional problems came up:
this week i rebooted (kernel updates) gluster11 and gluster13 (the
ones with the "sick" bricksdd1), and for these 2 bricks 2 processes
are started, making it unavailable.

root      2118  0.1  0.0 944596 12452 ?        Ssl  07:25   0:00
/usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
/var/run/gluster/glustershd/glustershd.pid -l
/var/log/glusterfs/glustershd.log -S
/var/run/gluster/546621eb24596f4c.socket --xlator-option
*replicate*.node-uuid=4fdb11c3-a5af-4e18-af48-182c00b88cc8
--process-name glustershd
root      2197  0.5  0.0 540808  8672 ?        Ssl  07:25   0:00
/usr/sbin/glusterfsd -s gluster13 --volfile-id
shared.gluster13.gluster-bricksdd1_new-shared -p
/var/run/gluster/vols/shared/gluster13-gluster-bricksdd1_new-shared.pid
-S /var/run/gluster/23f68b171e2f2c9e.socket --brick-name
/gluster/bricksdd1_new/shared -l
/var/log/glusterfs/bricks/gluster-bricksdd1_new-shared.log
--xlator-option
*-posix.glusterd-uuid=4fdb11c3-a5af-4e18-af48-182c00b88cc8
--process-name brick --brick-port 49155 --xlator-option
shared-server.listen-port=49155

In the brick log for bricksdd1_new i see:

[2018-12-12 06:20:41.817978] I [rpcsvc.c:2052:rpcsvc_spawn_threads]
0-rpc-service: spawned 1 threads for program 'GlusterFS 3.3'; total
count:1
[2018-12-12 06:20:41.818048] I [rpcsvc.c:2052:rpcsvc_spawn_threads]
0-rpc-service: spawned 1 threads for program 'GlusterFS 4.x v1'; total
count:1

A simple 'gluster volume start shared force' ended up in having 4
processes for that brick. I had to do the following twice:

- kill the 2 brick processes
- gluster volume start shared force

After the 2nd try there was only 1 brick process left. Heal started
etc. Has anyone seen that there are 2 processes for one brick? I
followed the upgrade guide
(https://docs.gluster.org/en/latest/Upgrade-Guide/upgrade_to_4.1/),
but is there anything one can do?

3rd problem: i've seen some additional issues that on the mounted
volume some clients can't see some directories, even if they are
there, but other clients do. Example:

client1: ls /data/repository/shared/public/staticmap/118/
ls: cannot access '/data/repository/shared/public/staticmap/118/408':
No such file or directory
238  255  272  289  306  323  340  357  374  391  408 478 [...]

client1: ls /data/repository/shared/public/staticmap/118/408/
ls: cannot access '/data/repository/shared/public/staticmap/118/408/':
No such file or directory

client2: ls /data/repository/shared/public/staticmap/118/408/
118408013  118408051  118408260  118408285  118408334  118408399 [...]

mount options: nothing special. from /etc/fstab:

gluster13:/shared /shared glusterfs defaults,_netdev 0 0

By doing a umount/mount the problem disappears.
umount /data/repository/shared ; mount -t glusterfs gluster12:/shared
/data/repository/shared

Has anyone had or seen such problems?

Thx
Hubert

Hu Bert

2018-Dec-19 14:14 UTC

head link

[Gluster-users] gluster 4.1.6 brick problems: 2 processes for one brick, performance problems

Hi,
so it seems like that no one has seen my problems before:

- doubled brick process after reboot (after upgrading 3.12.16 -> 4.1.6)
- directories can't be seen via e.g. ls

We even see an error 'transport endpoint not connected' on some
clients when e.g. changing a volume param or a simple operation like
ls on a directory with a couple of hundreds subdirs takes too long.
umount+mount fixes this. but it seems the setup is too messed up to
rescue. seems like we have to look for a different/reliable/suitable
solution.

Am Mi., 12. Dez. 2018 um 10:41 Uhr schrieb Hu Bert <revirii at
googlemail.com>:>
> Hello,
>
> we started with a gluster installation: 3.12.11. 3 servers (gluster11,
> gluster12, gluster13) and 4 bricks (each hdd == brick, JBOD behind
> controller) per server: bricksda1, bricksdb1, bricksdc1, bricksdd1;
> full information: see here: https://pastebin.com/0ndDSstG
>
> In the beginning everything was running fine so far. In May one hdd
> (sdd on gluster13) died and got replaced; i replaced the brick and the
> self-heal started, taking weeks and worsening performance. One week
> after the heal had finished another hdd (sdd on gluster12) died -> did
> the same again, it again took weeks, bad performance etc.
>
> After the replace/heal now the performance on most of the bricks was
> ok, but 2 have a bad performance; in short:
>
> gluster11: no hdd change, bricksd(a|b|c) ok, bricksdd takes much
> longer for requests
> gluster12: 1 hdd change, all bricks with normal performance
> gluster13: 1 hdd change, bricksd(a|b|c) ok, bricksdd takes much longer
> for requests
>
> We've checked (thx to Pranith and Xavi) hardware, disks speed, gluster
> settings etc., but only the 2 bricksdd on gluster11+13 take much
> longer (>2x) for each request, worsening the overall gluster
> performance. So something must be wrong, especially with bricksdd1.
> Anyone knows how to check how to investigate this?
>
> 2nd problen: during all these checks and searches we upgraded
> glusterfs from 3.12.11 -> 3.12.15 and finally to 4.1.6, but the
> problems didn't disappear. Well, and some additional problems came up:
> this week i rebooted (kernel updates) gluster11 and gluster13 (the
> ones with the "sick" bricksdd1), and for these 2 bricks 2
processes
> are started, making it unavailable.
>
> root      2118  0.1  0.0 944596 12452 ?        Ssl  07:25   0:00
> /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
> /var/run/gluster/glustershd/glustershd.pid -l
> /var/log/glusterfs/glustershd.log -S
> /var/run/gluster/546621eb24596f4c.socket --xlator-option
> *replicate*.node-uuid=4fdb11c3-a5af-4e18-af48-182c00b88cc8
> --process-name glustershd
> root      2197  0.5  0.0 540808  8672 ?        Ssl  07:25   0:00
> /usr/sbin/glusterfsd -s gluster13 --volfile-id
> shared.gluster13.gluster-bricksdd1_new-shared -p
> /var/run/gluster/vols/shared/gluster13-gluster-bricksdd1_new-shared.pid
> -S /var/run/gluster/23f68b171e2f2c9e.socket --brick-name
> /gluster/bricksdd1_new/shared -l
> /var/log/glusterfs/bricks/gluster-bricksdd1_new-shared.log
> --xlator-option
> *-posix.glusterd-uuid=4fdb11c3-a5af-4e18-af48-182c00b88cc8
> --process-name brick --brick-port 49155 --xlator-option
> shared-server.listen-port=49155
>
> In the brick log for bricksdd1_new i see:
>
> [2018-12-12 06:20:41.817978] I [rpcsvc.c:2052:rpcsvc_spawn_threads]
> 0-rpc-service: spawned 1 threads for program 'GlusterFS 3.3'; total
> count:1
> [2018-12-12 06:20:41.818048] I [rpcsvc.c:2052:rpcsvc_spawn_threads]
> 0-rpc-service: spawned 1 threads for program 'GlusterFS 4.x v1';
total
> count:1
>
> A simple 'gluster volume start shared force' ended up in having 4
> processes for that brick. I had to do the following twice:
>
> - kill the 2 brick processes
> - gluster volume start shared force
>
> After the 2nd try there was only 1 brick process left. Heal started
> etc. Has anyone seen that there are 2 processes for one brick? I
> followed the upgrade guide
> (https://docs.gluster.org/en/latest/Upgrade-Guide/upgrade_to_4.1/),
> but is there anything one can do?
>
> 3rd problem: i've seen some additional issues that on the mounted
> volume some clients can't see some directories, even if they are
> there, but other clients do. Example:
>
> client1: ls /data/repository/shared/public/staticmap/118/
> ls: cannot access
'/data/repository/shared/public/staticmap/118/408':
> No such file or directory
> 238  255  272  289  306  323  340  357  374  391  408 478 [...]
>
> client1: ls /data/repository/shared/public/staticmap/118/408/
> ls: cannot access
'/data/repository/shared/public/staticmap/118/408/':
> No such file or directory
>
> client2: ls /data/repository/shared/public/staticmap/118/408/
> 118408013  118408051  118408260  118408285  118408334  118408399 [...]
>
> mount options: nothing special. from /etc/fstab:
>
> gluster13:/shared /shared glusterfs defaults,_netdev 0 0
>
> By doing a umount/mount the problem disappears.
> umount /data/repository/shared ; mount -t glusterfs gluster12:/shared
> /data/repository/shared
>
> Has anyone had or seen such problems?
>
> Thx
> Hubert

Gluster users - Dec 2018 - gluster 4.1.6 brick problems: 2 processes for one brick, performance problems

[Gluster-users] gluster 4.1.6 brick problems: 2 processes for one brick, performance problems

[Gluster-users] gluster 4.1.6 brick problems: 2 processes for one brick, performance problems