Thomas Bätzler
2020-Nov-08 15:29 UTC
[Gluster-users] Weird glusterfs behaviour after add-bricks and fix-layout
Hi, the other day we decided to expand our gluster storage by adding two bricks, going from a 3x2 to a 4x2 distributed-replicated setup. In order to get to this point we had done rolling upgrades from 3.something to 5.13 to 7.8, all without issues. We ran into a spot of trouble during the fix-layout when both of the new nodes crashed in the space of two days. We rebooted them and the fix-layout process seemed to cope by starting itself again on these nodes. Now weird things are happening. We noticed that we can't access some files diretly anymore. Doing a stat on them returns a file not found. However, if we list the directory containing the files, the file is shown as present and subsequently it can be accessed on the client that ran the list, but not on other clients! Also, if we umount and remount, the file is inaccessible again. Does anybody have any ideas what's going on here? Is there any way to fix the volume without taking it offline for days? We have about 60T of data online and we need that data to be consistent and available. OS is Debian 10 with glusterfs-server 7.8-3. Volume configuration: Volume Name: hotcache Type: Distributed-Replicate Volume ID: 4c006efa-6fd6-4809-93b0-28dd33fee2d2 Status: Started Snapshot Count: 0 Number of Bricks: 4 x 2 = 8 Transport-type: tcp Bricks: Brick1: hotcache1:/data/glusterfs/drive1/hotcache Brick2: hotcache2:/data/glusterfs/drive1/hotcache Brick3: hotcache3:/data/glusterfs/drive1/hotcache Brick4: hotcache4:/data/glusterfs/drive1/hotcache Brick5: hotcache5:/data/glusterfs/drive1/hotcache Brick6: hotcache6:/data/glusterfs/drive1/hotcache Brick7: hotcache7:/data/glusterfs/drive1/hotcache Brick8: hotcache8:/data/glusterfs/drive1/hotcache Options Reconfigured: performance.readdir-ahead: off diagnostics.client-log-level: INFO diagnostics.brick-log-level: INFO diagnostics.count-fop-hits: on diagnostics.latency-measurement: on server.statedump-path: /var/tmp diagnostics.brick-sys-log-level: ERROR storage.linux-aio: on performance.read-ahead: off performance.write-behind-window-size: 4MB performance.cache-max-file-size: 200kb nfs.disable: on performance.cache-refresh-timeout: 1 performance.io-cache: on performance.stat-prefetch: off performance.quick-read: on performance.io-thread-count: 16 auth.allow: * cluster.readdir-optimize: on performance.flush-behind: off transport.address-family: inet cluster.self-heal-daemon: enable TIA! Best regards, Thomas B?tzler -- BRINGE Informationstechnik GmbH Zur Seeplatte 12 D-76228 Karlsruhe Germany Fon: +49 721 94246-0 Fon: +49 171 5438457 Fax: +49 721 94246-66 Web: http://www.bringe.de/ Gesch?ftsf?hrer: Dipl.-Ing. (FH) Martin Bringe Ust.Id: DE812936645, HRB 108943 Mannheim
Barak Sason Rofman
2020-Nov-09 11:29 UTC
[Gluster-users] Weird glusterfs behaviour after add-bricks and fix-layout
Greetings Thomas, I'll try to assist in determining RC and resolving the issue. The following will help me assisting you: 1. Please create a an issue on GitHub with all the relevant details, it'll be easier to track 2. Please provide all client-side, brick-side and fix-layout logs With that information I could begin an initial assessment of the situation. Thank you, On Sun, Nov 8, 2020 at 5:35 PM Thomas B?tzler <t.baetzler at bringe.com> wrote:> Hi, > > the other day we decided to expand our gluster storage by adding two > bricks, going from a 3x2 to a 4x2 distributed-replicated setup. In order > to get to this point we had done rolling upgrades from 3.something to > 5.13 to 7.8, all without issues. We ran into a spot of trouble during > the fix-layout when both of the new nodes crashed in the space of two > days. We rebooted them and the fix-layout process seemed to cope by > starting itself again on these nodes. > > Now weird things are happening. We noticed that we can't access some > files diretly anymore. Doing a stat on them returns a file not found. > However, if we list the directory containing the files, the file is > shown as present and subsequently it can be accessed on the client that > ran the list, but not on other clients! Also, if we umount and remount, > the file is inaccessible again. > > Does anybody have any ideas what's going on here? Is there any way to > fix the volume without taking it offline for days? We have about 60T of > data online and we need that data to be consistent and available. > > OS is Debian 10 with glusterfs-server 7.8-3. > > Volume configuration: > > Volume Name: hotcache > Type: Distributed-Replicate > Volume ID: 4c006efa-6fd6-4809-93b0-28dd33fee2d2 > Status: Started > Snapshot Count: 0 > Number of Bricks: 4 x 2 = 8 > Transport-type: tcp > Bricks: > Brick1: hotcache1:/data/glusterfs/drive1/hotcache > Brick2: hotcache2:/data/glusterfs/drive1/hotcache > Brick3: hotcache3:/data/glusterfs/drive1/hotcache > Brick4: hotcache4:/data/glusterfs/drive1/hotcache > Brick5: hotcache5:/data/glusterfs/drive1/hotcache > Brick6: hotcache6:/data/glusterfs/drive1/hotcache > Brick7: hotcache7:/data/glusterfs/drive1/hotcache > Brick8: hotcache8:/data/glusterfs/drive1/hotcache > Options Reconfigured: > performance.readdir-ahead: off > diagnostics.client-log-level: INFO > diagnostics.brick-log-level: INFO > diagnostics.count-fop-hits: on > diagnostics.latency-measurement: on > server.statedump-path: /var/tmp > diagnostics.brick-sys-log-level: ERROR > storage.linux-aio: on > performance.read-ahead: off > performance.write-behind-window-size: 4MB > performance.cache-max-file-size: 200kb > nfs.disable: on > performance.cache-refresh-timeout: 1 > performance.io-cache: on > performance.stat-prefetch: off > performance.quick-read: on > performance.io-thread-count: 16 > auth.allow: * > cluster.readdir-optimize: on > performance.flush-behind: off > transport.address-family: inet > cluster.self-heal-daemon: enable > > TIA! > > Best regards, > Thomas B?tzler > -- > BRINGE Informationstechnik GmbH > Zur Seeplatte 12 > D-76228 Karlsruhe > Germany > > Fon: +49 721 94246-0 > Fon: +49 171 5438457 > Fax: +49 721 94246-66 > Web: http://www.bringe.de/ > > Gesch?ftsf?hrer: Dipl.-Ing. (FH) Martin Bringe > Ust.Id: DE812936645, HRB 108943 Mannheim > > ________ > > > > Community Meeting Calendar: > > Schedule - > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > Bridge: https://meet.google.com/cpu-eiue-hvk > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users >-- *Barak Sason Rofman* Gluster Storage Development Red Hat Israel <https://www.redhat.com/> 34 Jerusalem rd. Ra'anana, 43501 bsasonro at redhat.com <adi at redhat.com> T: *+972-9-7692304* M: *+972-52-4326355* @RedHat <https://twitter.com/redhat> Red Hat <https://www.linkedin.com/company/red-hat> Red Hat <https://www.facebook.com/redhat.il/> <https://red.ht/sig> -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20201109/5c6e3a8d/attachment.html>