Filed a bug report. I was not able to reproduce the issue on x86 hardware.
https://bugzilla.redhat.com/show_bug.cgi?id=1811373
On Mon, Mar 2, 2020 at 1:58 AM Strahil Nikolov <hunter86_bg at yahoo.com>
wrote:
> On March 2, 2020 3:29:06 AM GMT+02:00, Fox <foxxz.net at gmail.com>
wrote:
> >The brick is mounted. However glusterfsd crashes shortly after startup.
> >This happens on any host that needs to heal a dispersed volume.
> >
> >I spent today doing a clean rebuild of the cluster. Clean install of
> >ubuntu
> >18 and gluster 7.2. Create a dispersed volume. Reboot one of the
> >cluster
> >members while the volume is up and online. When that cluster member
> >comes
> >back it can not heal.
> >
> >I was able to replicate this behavior with raspberry pis running
> >raspbian
> >and gluster 5 so it looks like its not limited to the specific hardware
> >or
> >version of gluster I'm using but perhaps the ARM architecture as a
> >whole.
> >
> >Thank you for your help. Aside from not using dispersed volumes I
don't
> >think there is much more I can do. Submit a bug report I guess :)
> >
> >
> >
> >
> >
> >On Sun, Mar 1, 2020 at 12:02 PM Strahil Nikolov <hunter86_bg at
yahoo.com>
> >wrote:
> >
> >> On March 1, 2020 6:22:59 PM GMT+02:00, Fox <foxxz.net at
gmail.com>
> >wrote:
> >> >Yes the brick was up and running. And I can see files on the
brick
> >> >created
> >> >by connected clients up until the node was rebooted.
> >> >
> >> >This is what the volume status looks like after gluster12 was
> >rebooted.
> >> >Prior to reboot it showed as online and was otherwise
operational.
> >> >
> >> >root at gluster01:~# gluster volume status
> >> >Status of volume: disp1
> >> >Gluster process TCP Port RDMA
Port
> >Online
> >> > Pid
> >>
> >>
>
>
>>------------------------------------------------------------------------------
> >> >Brick gluster01:/exports/sda/brick1/disp1 49152 0
Y
> >> >3931
> >> >Brick gluster02:/exports/sda/brick1/disp1 49152 0
Y
> >> >2755
> >> >Brick gluster03:/exports/sda/brick1/disp1 49152 0
Y
> >> >2787
> >> >Brick gluster04:/exports/sda/brick1/disp1 49152 0
Y
> >> >2780
> >> >Brick gluster05:/exports/sda/brick1/disp1 49152 0
Y
> >> >2764
> >> >Brick gluster06:/exports/sda/brick1/disp1 49152 0
Y
> >> >2760
> >> >Brick gluster07:/exports/sda/brick1/disp1 49152 0
Y
> >> >2740
> >> >Brick gluster08:/exports/sda/brick1/disp1 49152 0
Y
> >> >2729
> >> >Brick gluster09:/exports/sda/brick1/disp1 49152 0
Y
> >> >2772
> >> >Brick gluster10:/exports/sda/brick1/disp1 49152 0
Y
> >> >2791
> >> >Brick gluster11:/exports/sda/brick1/disp1 49152 0
Y
> >> >2026
> >> >Brick gluster12:/exports/sda/brick1/disp1 N/A N/A
N
> >> >N/A
> >> >Self-heal Daemon on localhost N/A N/A
Y
> >> >3952
> >> >Self-heal Daemon on gluster03 N/A N/A
Y
> >> >2808
> >> >Self-heal Daemon on gluster02 N/A N/A
Y
> >> >2776
> >> >Self-heal Daemon on gluster06 N/A N/A
Y
> >> >2781
> >> >Self-heal Daemon on gluster07 N/A N/A
Y
> >> >2761
> >> >Self-heal Daemon on gluster05 N/A N/A
Y
> >> >2785
> >> >Self-heal Daemon on gluster08 N/A N/A
Y
> >> >2750
> >> >Self-heal Daemon on gluster04 N/A N/A
Y
> >> >2801
> >> >Self-heal Daemon on gluster09 N/A N/A
Y
> >> >2793
> >> >Self-heal Daemon on gluster11 N/A N/A
Y
> >> >2047
> >> >Self-heal Daemon on gluster10 N/A N/A
Y
> >> >2812
> >> >Self-heal Daemon on gluster12 N/A N/A
Y
> >> >542
> >> >
> >> >Task Status of Volume disp1
> >>
> >>
>
>
>>------------------------------------------------------------------------------
> >> >There are no active volume tasks
> >> >
> >> >On Sun, Mar 1, 2020 at 2:01 AM Strahil Nikolov
> ><hunter86_bg at yahoo.com>
> >> >wrote:
> >> >
> >> >> On March 1, 2020 6:08:31 AM GMT+02:00, Fox <foxxz.net
at gmail.com>
> >> >wrote:
> >> >> >I am using a dozen odriod HC2 ARM systems each with a
single
> >> >HD/brick.
> >> >> >Running ubuntu 18 and glusterfs 7.2 installed from
the gluster
> >PPA.
> >> >> >
> >> >> >I can create a dispersed volume and use it. If one of
the cluster
> >> >> >members
> >> >> >duck out, say gluster12 reboots, when it comes back
online it
> >shows
> >> >> >connected in the peer list but using
> >> >> >gluster volume heal <volname> info summary
> >> >> >
> >> >> >It shows up as
> >> >> >Brick gluster12:/exports/sda/brick1/disp1
> >> >> >Status: Transport endpoint is not connected
> >> >> >Total Number of entries: -
> >> >> >Number of entries in heal pending: -
> >> >> >Number of entries in split-brain: -
> >> >> >Number of entries possibly healing: -
> >> >> >
> >> >> >Trying to force a full heal doesn't fix it. The
cluster member
> >> >> >otherwise
> >> >> >works and heals for other non-disperse volumes even
while showing
> >up
> >> >as
> >> >> >disconnected for the dispersed volume.
> >> >> >
> >> >> >I have attached a terminal log of the volume creation
and
> >diagnostic
> >> >> >output. Could this be an ARM specific problem?
> >> >> >
> >> >> >I tested a similar setup on x86 virtual machines.
They were able
> >to
> >> >> >heal a
> >> >> >dispersed volume no problem. One thing I see in the
ARM logs I
> >don't
> >> >> >see in
> >> >> >the x86 logs is lots of this..
> >> >> >[2020-03-01 03:54:45.856769] W [MSGID: 122035]
> >> >> >[ec-common.c:668:ec_child_select] 0-disp1-disperse-0:
Executing
> >> >> >operation
> >> >> >with some subvolumes unavailable. (800). FOP :
'LOOKUP' failed on
> >> >> >'(null)'
> >> >> >with gfid 0d3c4cf3-e09c-4b9a-87d3-cdfc4f49b692
> >> >> >[2020-03-01 03:54:45.910203] W [MSGID: 122035]
> >> >> >[ec-common.c:668:ec_child_select] 0-disp1-disperse-0:
Executing
> >> >> >operation
> >> >> >with some subvolumes unavailable. (800). FOP :
'LOOKUP' failed on
> >> >> >'(null)'
> >> >> >with gfid 0d806805-81e4-47ee-a331-1808b34949bf
> >> >> >[2020-03-01 03:54:45.932734] I
> >[rpc-clnt.c:1963:rpc_clnt_reconfig]
> >> >> >0-disp1-client-11: changing port to 49152 (from 0)
> >> >> >[2020-03-01 03:54:45.956803] W [MSGID: 122035]
> >> >> >[ec-common.c:668:ec_child_select] 0-disp1-disperse-0:
Executing
> >> >> >operation
> >> >> >with some subvolumes unavailable. (800). FOP :
'LOOKUP' failed on
> >> >> >'(null)'
> >> >> >with gfid d5768bad-7409-40f4-af98-4aef391d7ae4
> >> >> >[2020-03-01 03:54:46.000102] W [MSGID: 122035]
> >> >> >[ec-common.c:668:ec_child_select] 0-disp1-disperse-0:
Executing
> >> >> >operation
> >> >> >with some subvolumes unavailable. (800). FOP :
'LOOKUP' failed on
> >> >> >'(null)'
> >> >> >with gfid 216f5583-e1b4-49cf-bef9-8cd34617beaf
> >> >> >[2020-03-01 03:54:46.044184] W [MSGID: 122035]
> >> >> >[ec-common.c:668:ec_child_select] 0-disp1-disperse-0:
Executing
> >> >> >operation
> >> >> >with some subvolumes unavailable. (800). FOP :
'LOOKUP' failed on
> >> >> >'(null)'
> >> >> >with gfid 1b610b49-2d69-4ee6-a440-5d3edd6693d1
> >> >>
> >> >> Hi,
> >> >>
> >> >> Are you sure that the gluster bricks on this node is up
and
> >running ?
> >> >> What is the output of 'gluster volume status' on
this system ?
> >> >>
> >> >> Best Regards,
> >> >> Strahil Nikolov
> >> >>
> >>
> >> This seems like the brick is down.
> >> Check with 'ps aux | grep glusterfsd | grep disp1' on the
'gluster12'
> >.
> >> Most probably it is down and you need to verify the brick is
> >properly
> >> mounted.
> >>
> >> Best Regards,
> >> Strahil Nikolov
> >>
>
> Hi Fox,
>
>
> Submit a bug and provide a link in the mailing list (add the
> gluster-devel in CC once you register for that).
> Most probably it's a small thing that can be easily fixed.
>
> Have you tried to:
> gluster volume start <VOLNAME> force
>
> Best Regards,
> Strahil Nikolov
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20200307/aae4416f/attachment.html>