Hu Bert
2018-Dec-12 09:41 UTC
[Gluster-users] gluster 4.1.6 brick problems: 2 processes for one brick, performance problems
Hello, we started with a gluster installation: 3.12.11. 3 servers (gluster11, gluster12, gluster13) and 4 bricks (each hdd == brick, JBOD behind controller) per server: bricksda1, bricksdb1, bricksdc1, bricksdd1; full information: see here: https://pastebin.com/0ndDSstG In the beginning everything was running fine so far. In May one hdd (sdd on gluster13) died and got replaced; i replaced the brick and the self-heal started, taking weeks and worsening performance. One week after the heal had finished another hdd (sdd on gluster12) died -> did the same again, it again took weeks, bad performance etc. After the replace/heal now the performance on most of the bricks was ok, but 2 have a bad performance; in short: gluster11: no hdd change, bricksd(a|b|c) ok, bricksdd takes much longer for requests gluster12: 1 hdd change, all bricks with normal performance gluster13: 1 hdd change, bricksd(a|b|c) ok, bricksdd takes much longer for requests We've checked (thx to Pranith and Xavi) hardware, disks speed, gluster settings etc., but only the 2 bricksdd on gluster11+13 take much longer (>2x) for each request, worsening the overall gluster performance. So something must be wrong, especially with bricksdd1. Anyone knows how to check how to investigate this? 2nd problen: during all these checks and searches we upgraded glusterfs from 3.12.11 -> 3.12.15 and finally to 4.1.6, but the problems didn't disappear. Well, and some additional problems came up: this week i rebooted (kernel updates) gluster11 and gluster13 (the ones with the "sick" bricksdd1), and for these 2 bricks 2 processes are started, making it unavailable. root 2118 0.1 0.0 944596 12452 ? Ssl 07:25 0:00 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/546621eb24596f4c.socket --xlator-option *replicate*.node-uuid=4fdb11c3-a5af-4e18-af48-182c00b88cc8 --process-name glustershd root 2197 0.5 0.0 540808 8672 ? Ssl 07:25 0:00 /usr/sbin/glusterfsd -s gluster13 --volfile-id shared.gluster13.gluster-bricksdd1_new-shared -p /var/run/gluster/vols/shared/gluster13-gluster-bricksdd1_new-shared.pid -S /var/run/gluster/23f68b171e2f2c9e.socket --brick-name /gluster/bricksdd1_new/shared -l /var/log/glusterfs/bricks/gluster-bricksdd1_new-shared.log --xlator-option *-posix.glusterd-uuid=4fdb11c3-a5af-4e18-af48-182c00b88cc8 --process-name brick --brick-port 49155 --xlator-option shared-server.listen-port=49155 In the brick log for bricksdd1_new i see: [2018-12-12 06:20:41.817978] I [rpcsvc.c:2052:rpcsvc_spawn_threads] 0-rpc-service: spawned 1 threads for program 'GlusterFS 3.3'; total count:1 [2018-12-12 06:20:41.818048] I [rpcsvc.c:2052:rpcsvc_spawn_threads] 0-rpc-service: spawned 1 threads for program 'GlusterFS 4.x v1'; total count:1 A simple 'gluster volume start shared force' ended up in having 4 processes for that brick. I had to do the following twice: - kill the 2 brick processes - gluster volume start shared force After the 2nd try there was only 1 brick process left. Heal started etc. Has anyone seen that there are 2 processes for one brick? I followed the upgrade guide (https://docs.gluster.org/en/latest/Upgrade-Guide/upgrade_to_4.1/), but is there anything one can do? 3rd problem: i've seen some additional issues that on the mounted volume some clients can't see some directories, even if they are there, but other clients do. Example: client1: ls /data/repository/shared/public/staticmap/118/ ls: cannot access '/data/repository/shared/public/staticmap/118/408': No such file or directory 238 255 272 289 306 323 340 357 374 391 408 478 [...] client1: ls /data/repository/shared/public/staticmap/118/408/ ls: cannot access '/data/repository/shared/public/staticmap/118/408/': No such file or directory client2: ls /data/repository/shared/public/staticmap/118/408/ 118408013 118408051 118408260 118408285 118408334 118408399 [...] mount options: nothing special. from /etc/fstab: gluster13:/shared /shared glusterfs defaults,_netdev 0 0 By doing a umount/mount the problem disappears. umount /data/repository/shared ; mount -t glusterfs gluster12:/shared /data/repository/shared Has anyone had or seen such problems? Thx Hubert
Hu Bert
2018-Dec-19 14:14 UTC
[Gluster-users] gluster 4.1.6 brick problems: 2 processes for one brick, performance problems
Hi, so it seems like that no one has seen my problems before: - doubled brick process after reboot (after upgrading 3.12.16 -> 4.1.6) - directories can't be seen via e.g. ls We even see an error 'transport endpoint not connected' on some clients when e.g. changing a volume param or a simple operation like ls on a directory with a couple of hundreds subdirs takes too long. umount+mount fixes this. but it seems the setup is too messed up to rescue. seems like we have to look for a different/reliable/suitable solution. Am Mi., 12. Dez. 2018 um 10:41 Uhr schrieb Hu Bert <revirii at googlemail.com>:> > Hello, > > we started with a gluster installation: 3.12.11. 3 servers (gluster11, > gluster12, gluster13) and 4 bricks (each hdd == brick, JBOD behind > controller) per server: bricksda1, bricksdb1, bricksdc1, bricksdd1; > full information: see here: https://pastebin.com/0ndDSstG > > In the beginning everything was running fine so far. In May one hdd > (sdd on gluster13) died and got replaced; i replaced the brick and the > self-heal started, taking weeks and worsening performance. One week > after the heal had finished another hdd (sdd on gluster12) died -> did > the same again, it again took weeks, bad performance etc. > > After the replace/heal now the performance on most of the bricks was > ok, but 2 have a bad performance; in short: > > gluster11: no hdd change, bricksd(a|b|c) ok, bricksdd takes much > longer for requests > gluster12: 1 hdd change, all bricks with normal performance > gluster13: 1 hdd change, bricksd(a|b|c) ok, bricksdd takes much longer > for requests > > We've checked (thx to Pranith and Xavi) hardware, disks speed, gluster > settings etc., but only the 2 bricksdd on gluster11+13 take much > longer (>2x) for each request, worsening the overall gluster > performance. So something must be wrong, especially with bricksdd1. > Anyone knows how to check how to investigate this? > > 2nd problen: during all these checks and searches we upgraded > glusterfs from 3.12.11 -> 3.12.15 and finally to 4.1.6, but the > problems didn't disappear. Well, and some additional problems came up: > this week i rebooted (kernel updates) gluster11 and gluster13 (the > ones with the "sick" bricksdd1), and for these 2 bricks 2 processes > are started, making it unavailable. > > root 2118 0.1 0.0 944596 12452 ? Ssl 07:25 0:00 > /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p > /var/run/gluster/glustershd/glustershd.pid -l > /var/log/glusterfs/glustershd.log -S > /var/run/gluster/546621eb24596f4c.socket --xlator-option > *replicate*.node-uuid=4fdb11c3-a5af-4e18-af48-182c00b88cc8 > --process-name glustershd > root 2197 0.5 0.0 540808 8672 ? Ssl 07:25 0:00 > /usr/sbin/glusterfsd -s gluster13 --volfile-id > shared.gluster13.gluster-bricksdd1_new-shared -p > /var/run/gluster/vols/shared/gluster13-gluster-bricksdd1_new-shared.pid > -S /var/run/gluster/23f68b171e2f2c9e.socket --brick-name > /gluster/bricksdd1_new/shared -l > /var/log/glusterfs/bricks/gluster-bricksdd1_new-shared.log > --xlator-option > *-posix.glusterd-uuid=4fdb11c3-a5af-4e18-af48-182c00b88cc8 > --process-name brick --brick-port 49155 --xlator-option > shared-server.listen-port=49155 > > In the brick log for bricksdd1_new i see: > > [2018-12-12 06:20:41.817978] I [rpcsvc.c:2052:rpcsvc_spawn_threads] > 0-rpc-service: spawned 1 threads for program 'GlusterFS 3.3'; total > count:1 > [2018-12-12 06:20:41.818048] I [rpcsvc.c:2052:rpcsvc_spawn_threads] > 0-rpc-service: spawned 1 threads for program 'GlusterFS 4.x v1'; total > count:1 > > A simple 'gluster volume start shared force' ended up in having 4 > processes for that brick. I had to do the following twice: > > - kill the 2 brick processes > - gluster volume start shared force > > After the 2nd try there was only 1 brick process left. Heal started > etc. Has anyone seen that there are 2 processes for one brick? I > followed the upgrade guide > (https://docs.gluster.org/en/latest/Upgrade-Guide/upgrade_to_4.1/), > but is there anything one can do? > > 3rd problem: i've seen some additional issues that on the mounted > volume some clients can't see some directories, even if they are > there, but other clients do. Example: > > client1: ls /data/repository/shared/public/staticmap/118/ > ls: cannot access '/data/repository/shared/public/staticmap/118/408': > No such file or directory > 238 255 272 289 306 323 340 357 374 391 408 478 [...] > > client1: ls /data/repository/shared/public/staticmap/118/408/ > ls: cannot access '/data/repository/shared/public/staticmap/118/408/': > No such file or directory > > client2: ls /data/repository/shared/public/staticmap/118/408/ > 118408013 118408051 118408260 118408285 118408334 118408399 [...] > > mount options: nothing special. from /etc/fstab: > > gluster13:/shared /shared glusterfs defaults,_netdev 0 0 > > By doing a umount/mount the problem disappears. > umount /data/repository/shared ; mount -t glusterfs gluster12:/shared > /data/repository/shared > > Has anyone had or seen such problems? > > Thx > Hubert