Micha Ober
2016-Dec-06 23:32 UTC
[Gluster-users] RE : Frequent connect and disconnect messages flooded in logs
Hi, thank you for your answer and even more for the question! Until now, I was using FUSE. Today I changed all mounts to NFS using the same 3.7.17 version. But: The problem is still the same. Now, the NFS logfile contains lines like these: [2016-12-06 15:12:29.006325] C [rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired] 0-gv0-client-7: server X.X.18.62:49153 has not responded in the last 42 seconds, disconnecting. Interestingly enough, the IP address X.X.18.62 is the same machine! As I wrote earlier, each node serves both as a server and a client, as each node contributes bricks to the volume. Every server is connecting to itself via its hostname. For example, the fstab on the node "giant2" looks like: #giant2:/gv0 /shared_data glusterfs defaults,noauto 0 0 #giant2:/gv2 /shared_slurm glusterfs defaults,noauto 0 0 giant2:/gv0 /shared_data nfs defaults,_netdev,vers=3 0 0 giant2:/gv2 /shared_slurm nfs defaults,_netdev,vers=3 0 0 So I understand the disconnects even less. I don't know if it's possible to create a dummy cluster which exposes the same behaviour, because the disconnects only happen when there are compute jobs running on those nodes - and they are GPU compute jobs, so that's something which cannot be easily emulated in a VM. As we have more clusters (which are running fine with an ancient 3.4 version :-)) and we are currently not dependent on this particular cluster (which may stay like this for this month, I think) I should be able to deploy the debug build on the "real" cluster, if you can provide a debug build. Regards and thanks, Micha Am 06.12.2016 um 08:15 schrieb Mohammed Rafi K C:> > > > On 12/03/2016 12:56 AM, Micha Ober wrote: >> ** Update: ** I have downgraded from 3.8.6 to 3.7.17 now, but the >> problem still exists. >> >> Client log: http://paste.ubuntu.com/23569065/ >> Brick log: http://paste.ubuntu.com/23569067/ >> >> Please note that each server has two bricks. >> Whereas, according to the logs, one brick loses the connection to all >> other hosts: >> [2016-12-02 18:38:53.703301] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.219:49121 failed (Broken pipe) >> [2016-12-02 18:38:53.703381] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.62:49118 failed (Broken pipe) >> [2016-12-02 18:38:53.703380] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.107:49121 failed (Broken pipe) >> [2016-12-02 18:38:53.703424] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.206:49120 failed (Broken pipe) >> [2016-12-02 18:38:53.703359] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.58:49121 failed (Broken pipe) >> >> The SECOND brick on the SAME host is NOT affected, i.e. no disconnects! >> As I said, the network connection is fine and the disks are idle. >> The CPU always has 2 free cores. >> >> It looks like I have to downgrade to 3.4 now in order for the disconnects to stop. > > Hi Micha, > > Thanks for the update and sorry for what happened with gluster higher > versions. I can understand the need for downgrade as it is a > production setup. > > Can you tell me the clients used here ? whether it is a > fuse,nfs,nfs-ganesha, smb or libgfapi ? > > Since I'm not able to reproduce the issue (I have been trying from > last 3days) and the logs are not much helpful here (we don't have much > logs in socket layer), Could you please create a dummy cluster and try > to reproduce the issue? If then we can play with that volume and I > could provide some debug build which we can use for further debugging? > > If you don't have bandwidth for this, please leave it ;). > > Regards > Rafi KC > >> - Micha >> >> Am 30.11.2016 um 06:57 schrieb Mohammed Rafi K C: >>> >>> Hi Micha, >>> >>> I have changed the thread and subject so that your original thread >>> remain same for your query. Let's try to fix the problem what you >>> observed with 3.8.4, So I have started a new thread to discuss the >>> frequent disconnect problem. >>> >>> *If any one else has experienced the same problem, please respond to >>> the mail.* >>> >>> It would be very helpful if you could give us some more logs from >>> clients and bricks. Also any reproducible steps will surely help to >>> chase the problem further. >>> >>> Regards >>> >>> Rafi KC >>> >>> On 11/30/2016 04:44 AM, Micha Ober wrote: >>>> I had opened another thread on this mailing list (Subject: "After >>>> upgrade from 3.4.2 to 3.8.5 - High CPU usage resulting in >>>> disconnects and split-brain"). >>>> >>>> The title may be a bit misleading now, as I am no longer observing >>>> high CPU usage after upgrading to 3.8.6, but the disconnects are >>>> still happening and the number of files in split-brain is growing. >>>> >>>> Setup: 6 compute nodes, each serving as a glusterfs server and >>>> client, Ubuntu 14.04, two bricks per node, distribute-replicate >>>> >>>> I have two gluster volumes set up (one for scratch data, one for >>>> the slurm scheduler). Only the scratch data volume shows critical >>>> errors "[...] has not responded in the last 42 seconds, >>>> disconnecting.". So I can rule out network problems, the gigabit >>>> link between the nodes is not saturated at all. The disks are >>>> almost idle (<10%). >>>> >>>> I have glusterfs 3.4.2 on Ubuntu 12.04 on a another compute >>>> cluster, running fine since it was deployed. >>>> I had glusterfs 3.4.2 on Ubuntu 14.04 on this cluster, running fine >>>> for almost a year. >>>> >>>> After upgrading to 3.8.5, the problems (as described) started. I >>>> would like to use some of the new features of the newer versions >>>> (like bitrot), but the users can't run their compute jobs right now >>>> because the result files are garbled. >>>> >>>> There also seems to be a bug report with a smiliar problem: (but no >>>> progress) >>>> https://bugzilla.redhat.com/show_bug.cgi?id=1370683 >>>> >>>> For me, ALL servers are affected (not isolated to one or two servers) >>>> >>>> I also see messages like "INFO: task gpu_graphene_bv:4476 blocked >>>> for more than 120 seconds." in the syslog. >>>> >>>> For completeness (gv0 is the scratch volume, gv2 the slurm volume): >>>> >>>> [root at giant2: ~]# gluster v info >>>> >>>> Volume Name: gv0 >>>> Type: Distributed-Replicate >>>> Volume ID: 993ec7c9-e4bc-44d0-b7c4-2d977e622e86 >>>> Status: Started >>>> Snapshot Count: 0 >>>> Number of Bricks: 6 x 2 = 12 >>>> Transport-type: tcp >>>> Bricks: >>>> Brick1: giant1:/gluster/sdc/gv0 >>>> Brick2: giant2:/gluster/sdc/gv0 >>>> Brick3: giant3:/gluster/sdc/gv0 >>>> Brick4: giant4:/gluster/sdc/gv0 >>>> Brick5: giant5:/gluster/sdc/gv0 >>>> Brick6: giant6:/gluster/sdc/gv0 >>>> Brick7: giant1:/gluster/sdd/gv0 >>>> Brick8: giant2:/gluster/sdd/gv0 >>>> Brick9: giant3:/gluster/sdd/gv0 >>>> Brick10: giant4:/gluster/sdd/gv0 >>>> Brick11: giant5:/gluster/sdd/gv0 >>>> Brick12: giant6:/gluster/sdd/gv0 >>>> Options Reconfigured: >>>> auth.allow: X.X.X.*,127.0.0.1 >>>> nfs.disable: on >>>> >>>> Volume Name: gv2 >>>> Type: Replicate >>>> Volume ID: 30c78928-5f2c-4671-becc-8deaee1a7a8d >>>> Status: Started >>>> Snapshot Count: 0 >>>> Number of Bricks: 1 x 2 = 2 >>>> Transport-type: tcp >>>> Bricks: >>>> Brick1: giant1:/gluster/sdd/gv2 >>>> Brick2: giant2:/gluster/sdd/gv2 >>>> Options Reconfigured: >>>> auth.allow: X.X.X.*,127.0.0.1 >>>> cluster.granular-entry-heal: on >>>> cluster.locking-scheme: granular >>>> nfs.disable: on >>>> >>>> >>>> 2016-11-30 0:10 GMT+01:00 Micha Ober <micha2k at gmail.com >>>> <mailto:micha2k at gmail.com>>: >>>> >>>> There also seems to be a bug report with a smiliar problem: >>>> (but no progress) >>>> https://bugzilla.redhat.com/show_bug.cgi?id=1370683 >>>> <https://bugzilla.redhat.com/show_bug.cgi?id=1370683> >>>> >>>> For me, ALL servers are affected (not isolated to one or two >>>> servers) >>>> >>>> I also see messages like "INFO: task gpu_graphene_bv:4476 >>>> blocked for more than 120 seconds." in the syslog. >>>> >>>> For completeness (gv0 is the scratch volume, gv2 the slurm volume): >>>> >>>> [root at giant2: ~]# gluster v info >>>> >>>> Volume Name: gv0 >>>> Type: Distributed-Replicate >>>> Volume ID: 993ec7c9-e4bc-44d0-b7c4-2d977e622e86 >>>> Status: Started >>>> Snapshot Count: 0 >>>> Number of Bricks: 6 x 2 = 12 >>>> Transport-type: tcp >>>> Bricks: >>>> Brick1: giant1:/gluster/sdc/gv0 >>>> Brick2: giant2:/gluster/sdc/gv0 >>>> Brick3: giant3:/gluster/sdc/gv0 >>>> Brick4: giant4:/gluster/sdc/gv0 >>>> Brick5: giant5:/gluster/sdc/gv0 >>>> Brick6: giant6:/gluster/sdc/gv0 >>>> Brick7: giant1:/gluster/sdd/gv0 >>>> Brick8: giant2:/gluster/sdd/gv0 >>>> Brick9: giant3:/gluster/sdd/gv0 >>>> Brick10: giant4:/gluster/sdd/gv0 >>>> Brick11: giant5:/gluster/sdd/gv0 >>>> Brick12: giant6:/gluster/sdd/gv0 >>>> Options Reconfigured: >>>> auth.allow: X.X.X.*,127.0.0.1 >>>> nfs.disable: on >>>> >>>> Volume Name: gv2 >>>> Type: Replicate >>>> Volume ID: 30c78928-5f2c-4671-becc-8deaee1a7a8d >>>> Status: Started >>>> Snapshot Count: 0 >>>> Number of Bricks: 1 x 2 = 2 >>>> Transport-type: tcp >>>> Bricks: >>>> Brick1: giant1:/gluster/sdd/gv2 >>>> Brick2: giant2:/gluster/sdd/gv2 >>>> Options Reconfigured: >>>> auth.allow: X.X.X.*,127.0.0.1 >>>> cluster.granular-entry-heal: on >>>> cluster.locking-scheme: granular >>>> nfs.disable: on >>>> >>>> >>>> 2016-11-29 19:21 GMT+01:00 Micha Ober <micha2k at gmail.com>: >>>> >>>> I had opened another thread on this mailing list (Subject: >>>> "After upgrade from 3.4.2 to 3.8.5 - High CPU usage >>>> resulting in disconnects and split-brain"). >>>> >>>> The title may be a bit misleading now, as I am no longer >>>> observing high CPU usage after upgrading to 3.8.6, but the >>>> disconnects are still happening and the number of files in >>>> split-brain is growing. >>>> >>>> Setup: 6 compute nodes, each serving as a glusterfs server >>>> and client, Ubuntu 14.04, two bricks per node, >>>> distribute-replicate >>>> >>>> I have two gluster volumes set up (one for scratch data, >>>> one for the slurm scheduler). Only the scratch data volume >>>> shows critical errors "[...] has not responded in the last >>>> 42 seconds, disconnecting.". So I can rule out network >>>> problems, the gigabit link between the nodes is not >>>> saturated at all. The disks are almost idle (<10%). >>>> >>>> I have glusterfs 3.4.2 on Ubuntu 12.04 on a another compute >>>> cluster, running fine since it was deployed. >>>> I had glusterfs 3.4.2 on Ubuntu 14.04 on this cluster, >>>> running fine for almost a year. >>>> >>>> After upgrading to 3.8.5, the problems (as described) >>>> started. I would like to use some of the new features of >>>> the newer versions (like bitrot), but the users can't run >>>> their compute jobs right now because the result files are >>>> garbled. >>>> >>>> 2016-11-29 18:53 GMT+01:00 Atin Mukherjee >>>> <amukherj at redhat.com>: >>>> >>>> Would you be able to share what is not working for you >>>> in 3.8.x (mention the exact version). 3.4 is quite old >>>> and falling back to an unsupported version doesn't look >>>> a feasible option. >>>> >>>> On Tue, 29 Nov 2016 at 17:01, Micha Ober >>>> <micha2k at gmail.com> wrote: >>>> >>>> Hi, >>>> >>>> I was using gluster 3.4 and upgraded to 3.8, but >>>> that version showed to be unusable for me. I now >>>> need to downgrade. >>>> >>>> I'm running Ubuntu 14.04. As upgrades of the op >>>> version are irreversible, I guess I have to delete >>>> all gluster volumes and re-create them with the >>>> downgraded version. >>>> >>>> 0. Backup data >>>> 1. Unmount all gluster volumes >>>> 2. apt-get purge glusterfs-server glusterfs-client >>>> 3. Remove PPA for 3.8 >>>> 4. Add PPA for older version >>>> 5. apt-get install glusterfs-server glusterfs-client >>>> 6. Create volumes >>>> >>>> Is "purge" enough to delete all configuration files >>>> of the currently installed version or do I need to >>>> manually clear some residues before installing an >>>> older version? >>>> >>>> Thanks. >>>> _______________________________________________ >>>> Gluster-users mailing list >>>> Gluster-users at gluster.org >>>> <mailto:Gluster-users at gluster.org> >>>> http://www.gluster.org/mailman/listinfo/gluster-users >>>> <http://www.gluster.org/mailman/listinfo/gluster-users> >>>> >>>> -- >>>> - Atin (atinm) >>>> >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> Gluster-users mailing list >>>> Gluster-users at gluster.org >>>> http://www.gluster.org/mailman/listinfo/gluster-users >>> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20161207/bd9bc61a/attachment.html>
Mohammed Rafi K C
2016-Dec-07 17:08 UTC
[Gluster-users] RE : Frequent connect and disconnect messages flooded in logs
Hi Micha, This is great. I will provide you one debug build which has two fixes which I possible suspect for a frequent disconnect issue, though I don't have much data to validate my theory. So I will take one more day to dig in to that. Thanks for your support, and opensource++ Regards Rafi KC On 12/07/2016 05:02 AM, Micha Ober wrote:> Hi, > > thank you for your answer and even more for the question! > Until now, I was using FUSE. Today I changed all mounts to NFS using > the same 3.7.17 version. > > But: The problem is still the same. Now, the NFS logfile contains > lines like these: > > [2016-12-06 15:12:29.006325] C > [rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired] 0-gv0-client-7: > server X.X.18.62:49153 has not responded in the last 42 seconds, > disconnecting. > > Interestingly enough, the IP address X.X.18.62 is the same machine! > As I wrote earlier, each node serves both as a server and a client, as > each node contributes bricks to the volume. Every server is connecting > to itself via its hostname. For example, the fstab on the node > "giant2" looks like: > > #giant2:/gv0 /shared_data glusterfs defaults,noauto 0 0 > #giant2:/gv2 /shared_slurm glusterfs defaults,noauto 0 0 > > giant2:/gv0 /shared_data nfs > defaults,_netdev,vers=3 0 0 > giant2:/gv2 /shared_slurm nfs > defaults,_netdev,vers=3 0 0 > > So I understand the disconnects even less. > > I don't know if it's possible to create a dummy cluster which exposes > the same behaviour, because the disconnects only happen when there are > compute jobs running on those nodes - and they are GPU compute jobs, > so that's something which cannot be easily emulated in a VM. > > As we have more clusters (which are running fine with an ancient 3.4 > version :-)) and we are currently not dependent on this particular > cluster (which may stay like this for this month, I think) I should be > able to deploy the debug build on the "real" cluster, if you can > provide a debug build. > > Regards and thanks, > Micha > > > > Am 06.12.2016 um 08:15 schrieb Mohammed Rafi K C: >> >> >> >> On 12/03/2016 12:56 AM, Micha Ober wrote: >>> ** Update: ** I have downgraded from 3.8.6 to 3.7.17 now, but the >>> problem still exists. >>> >>> Client log: http://paste.ubuntu.com/23569065/ >>> Brick log: http://paste.ubuntu.com/23569067/ >>> >>> Please note that each server has two bricks. >>> Whereas, according to the logs, one brick loses the connection to >>> all other hosts: >>> [2016-12-02 18:38:53.703301] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.219:49121 failed (Broken pipe) >>> [2016-12-02 18:38:53.703381] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.62:49118 failed (Broken pipe) >>> [2016-12-02 18:38:53.703380] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.107:49121 failed (Broken pipe) >>> [2016-12-02 18:38:53.703424] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.206:49120 failed (Broken pipe) >>> [2016-12-02 18:38:53.703359] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.58:49121 failed (Broken pipe) >>> >>> The SECOND brick on the SAME host is NOT affected, i.e. no disconnects! >>> As I said, the network connection is fine and the disks are idle. >>> The CPU always has 2 free cores. >>> >>> It looks like I have to downgrade to 3.4 now in order for the disconnects to stop. >> >> Hi Micha, >> >> Thanks for the update and sorry for what happened with gluster higher >> versions. I can understand the need for downgrade as it is a >> production setup. >> >> Can you tell me the clients used here ? whether it is a >> fuse,nfs,nfs-ganesha, smb or libgfapi ? >> >> Since I'm not able to reproduce the issue (I have been trying from >> last 3days) and the logs are not much helpful here (we don't have >> much logs in socket layer), Could you please create a dummy cluster >> and try to reproduce the issue? If then we can play with that volume >> and I could provide some debug build which we can use for further >> debugging? >> >> If you don't have bandwidth for this, please leave it ;). >> >> Regards >> Rafi KC >> >>> - Micha >>> >>> Am 30.11.2016 um 06:57 schrieb Mohammed Rafi K C: >>>> >>>> Hi Micha, >>>> >>>> I have changed the thread and subject so that your original thread >>>> remain same for your query. Let's try to fix the problem what you >>>> observed with 3.8.4, So I have started a new thread to discuss the >>>> frequent disconnect problem. >>>> >>>> *If any one else has experienced the same problem, please respond >>>> to the mail.* >>>> >>>> It would be very helpful if you could give us some more logs from >>>> clients and bricks. Also any reproducible steps will surely help >>>> to chase the problem further. >>>> >>>> Regards >>>> >>>> Rafi KC >>>> >>>> On 11/30/2016 04:44 AM, Micha Ober wrote: >>>>> I had opened another thread on this mailing list (Subject: "After >>>>> upgrade from 3.4.2 to 3.8.5 - High CPU usage resulting in >>>>> disconnects and split-brain"). >>>>> >>>>> The title may be a bit misleading now, as I am no longer observing >>>>> high CPU usage after upgrading to 3.8.6, but the disconnects are >>>>> still happening and the number of files in split-brain is growing. >>>>> >>>>> Setup: 6 compute nodes, each serving as a glusterfs server and >>>>> client, Ubuntu 14.04, two bricks per node, distribute-replicate >>>>> >>>>> I have two gluster volumes set up (one for scratch data, one for >>>>> the slurm scheduler). Only the scratch data volume shows critical >>>>> errors "[...] has not responded in the last 42 seconds, >>>>> disconnecting.". So I can rule out network problems, the gigabit >>>>> link between the nodes is not saturated at all. The disks are >>>>> almost idle (<10%). >>>>> >>>>> I have glusterfs 3.4.2 on Ubuntu 12.04 on a another compute >>>>> cluster, running fine since it was deployed. >>>>> I had glusterfs 3.4.2 on Ubuntu 14.04 on this cluster, running >>>>> fine for almost a year. >>>>> >>>>> After upgrading to 3.8.5, the problems (as described) started. I >>>>> would like to use some of the new features of the newer versions >>>>> (like bitrot), but the users can't run their compute jobs right >>>>> now because the result files are garbled. >>>>> >>>>> There also seems to be a bug report with a smiliar problem: (but >>>>> no progress) >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1370683 >>>>> >>>>> For me, ALL servers are affected (not isolated to one or two servers) >>>>> >>>>> I also see messages like "INFO: task gpu_graphene_bv:4476 blocked >>>>> for more than 120 seconds." in the syslog. >>>>> >>>>> For completeness (gv0 is the scratch volume, gv2 the slurm volume): >>>>> >>>>> [root at giant2: ~]# gluster v info >>>>> >>>>> Volume Name: gv0 >>>>> Type: Distributed-Replicate >>>>> Volume ID: 993ec7c9-e4bc-44d0-b7c4-2d977e622e86 >>>>> Status: Started >>>>> Snapshot Count: 0 >>>>> Number of Bricks: 6 x 2 = 12 >>>>> Transport-type: tcp >>>>> Bricks: >>>>> Brick1: giant1:/gluster/sdc/gv0 >>>>> Brick2: giant2:/gluster/sdc/gv0 >>>>> Brick3: giant3:/gluster/sdc/gv0 >>>>> Brick4: giant4:/gluster/sdc/gv0 >>>>> Brick5: giant5:/gluster/sdc/gv0 >>>>> Brick6: giant6:/gluster/sdc/gv0 >>>>> Brick7: giant1:/gluster/sdd/gv0 >>>>> Brick8: giant2:/gluster/sdd/gv0 >>>>> Brick9: giant3:/gluster/sdd/gv0 >>>>> Brick10: giant4:/gluster/sdd/gv0 >>>>> Brick11: giant5:/gluster/sdd/gv0 >>>>> Brick12: giant6:/gluster/sdd/gv0 >>>>> Options Reconfigured: >>>>> auth.allow: X.X.X.*,127.0.0.1 >>>>> nfs.disable: on >>>>> >>>>> Volume Name: gv2 >>>>> Type: Replicate >>>>> Volume ID: 30c78928-5f2c-4671-becc-8deaee1a7a8d >>>>> Status: Started >>>>> Snapshot Count: 0 >>>>> Number of Bricks: 1 x 2 = 2 >>>>> Transport-type: tcp >>>>> Bricks: >>>>> Brick1: giant1:/gluster/sdd/gv2 >>>>> Brick2: giant2:/gluster/sdd/gv2 >>>>> Options Reconfigured: >>>>> auth.allow: X.X.X.*,127.0.0.1 >>>>> cluster.granular-entry-heal: on >>>>> cluster.locking-scheme: granular >>>>> nfs.disable: on >>>>> >>>>> >>>>> 2016-11-30 0:10 GMT+01:00 Micha Ober <micha2k at gmail.com >>>>> <mailto:micha2k at gmail.com>>: >>>>> >>>>> There also seems to be a bug report with a smiliar problem: >>>>> (but no progress) >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1370683 >>>>> <https://bugzilla.redhat.com/show_bug.cgi?id=1370683> >>>>> >>>>> For me, ALL servers are affected (not isolated to one or two >>>>> servers) >>>>> >>>>> I also see messages like "INFO: task gpu_graphene_bv:4476 >>>>> blocked for more than 120 seconds." in the syslog. >>>>> >>>>> For completeness (gv0 is the scratch volume, gv2 the slurm >>>>> volume): >>>>> >>>>> [root at giant2: ~]# gluster v info >>>>> >>>>> Volume Name: gv0 >>>>> Type: Distributed-Replicate >>>>> Volume ID: 993ec7c9-e4bc-44d0-b7c4-2d977e622e86 >>>>> Status: Started >>>>> Snapshot Count: 0 >>>>> Number of Bricks: 6 x 2 = 12 >>>>> Transport-type: tcp >>>>> Bricks: >>>>> Brick1: giant1:/gluster/sdc/gv0 >>>>> Brick2: giant2:/gluster/sdc/gv0 >>>>> Brick3: giant3:/gluster/sdc/gv0 >>>>> Brick4: giant4:/gluster/sdc/gv0 >>>>> Brick5: giant5:/gluster/sdc/gv0 >>>>> Brick6: giant6:/gluster/sdc/gv0 >>>>> Brick7: giant1:/gluster/sdd/gv0 >>>>> Brick8: giant2:/gluster/sdd/gv0 >>>>> Brick9: giant3:/gluster/sdd/gv0 >>>>> Brick10: giant4:/gluster/sdd/gv0 >>>>> Brick11: giant5:/gluster/sdd/gv0 >>>>> Brick12: giant6:/gluster/sdd/gv0 >>>>> Options Reconfigured: >>>>> auth.allow: X.X.X.*,127.0.0.1 >>>>> nfs.disable: on >>>>> >>>>> Volume Name: gv2 >>>>> Type: Replicate >>>>> Volume ID: 30c78928-5f2c-4671-becc-8deaee1a7a8d >>>>> Status: Started >>>>> Snapshot Count: 0 >>>>> Number of Bricks: 1 x 2 = 2 >>>>> Transport-type: tcp >>>>> Bricks: >>>>> Brick1: giant1:/gluster/sdd/gv2 >>>>> Brick2: giant2:/gluster/sdd/gv2 >>>>> Options Reconfigured: >>>>> auth.allow: X.X.X.*,127.0.0.1 >>>>> cluster.granular-entry-heal: on >>>>> cluster.locking-scheme: granular >>>>> nfs.disable: on >>>>> >>>>> >>>>> 2016-11-29 19:21 GMT+01:00 Micha Ober <micha2k at gmail.com>: >>>>> >>>>> I had opened another thread on this mailing list (Subject: >>>>> "After upgrade from 3.4.2 to 3.8.5 - High CPU usage >>>>> resulting in disconnects and split-brain"). >>>>> >>>>> The title may be a bit misleading now, as I am no longer >>>>> observing high CPU usage after upgrading to 3.8.6, but the >>>>> disconnects are still happening and the number of files in >>>>> split-brain is growing. >>>>> >>>>> Setup: 6 compute nodes, each serving as a glusterfs server >>>>> and client, Ubuntu 14.04, two bricks per node, >>>>> distribute-replicate >>>>> >>>>> I have two gluster volumes set up (one for scratch data, >>>>> one for the slurm scheduler). Only the scratch data volume >>>>> shows critical errors "[...] has not responded in the last >>>>> 42 seconds, disconnecting.". So I can rule out network >>>>> problems, the gigabit link between the nodes is not >>>>> saturated at all. The disks are almost idle (<10%). >>>>> >>>>> I have glusterfs 3.4.2 on Ubuntu 12.04 on a another >>>>> compute cluster, running fine since it was deployed. >>>>> I had glusterfs 3.4.2 on Ubuntu 14.04 on this cluster, >>>>> running fine for almost a year. >>>>> >>>>> After upgrading to 3.8.5, the problems (as described) >>>>> started. I would like to use some of the new features of >>>>> the newer versions (like bitrot), but the users can't run >>>>> their compute jobs right now because the result files are >>>>> garbled. >>>>> >>>>> 2016-11-29 18:53 GMT+01:00 Atin Mukherjee >>>>> <amukherj at redhat.com>: >>>>> >>>>> Would you be able to share what is not working for you >>>>> in 3.8.x (mention the exact version). 3.4 is quite old >>>>> and falling back to an unsupported version doesn't >>>>> look a feasible option. >>>>> >>>>> On Tue, 29 Nov 2016 at 17:01, Micha Ober >>>>> <micha2k at gmail.com> wrote: >>>>> >>>>> Hi, >>>>> >>>>> I was using gluster 3.4 and upgraded to 3.8, but >>>>> that version showed to be unusable for me. I now >>>>> need to downgrade. >>>>> >>>>> I'm running Ubuntu 14.04. As upgrades of the op >>>>> version are irreversible, I guess I have to delete >>>>> all gluster volumes and re-create them with the >>>>> downgraded version. >>>>> >>>>> 0. Backup data >>>>> 1. Unmount all gluster volumes >>>>> 2. apt-get purge glusterfs-server glusterfs-client >>>>> 3. Remove PPA for 3.8 >>>>> 4. Add PPA for older version >>>>> 5. apt-get install glusterfs-server glusterfs-client >>>>> 6. Create volumes >>>>> >>>>> Is "purge" enough to delete all configuration >>>>> files of the currently installed version or do I >>>>> need to manually clear some residues before >>>>> installing an older version? >>>>> >>>>> Thanks. >>>>> _______________________________________________ >>>>> Gluster-users mailing list >>>>> Gluster-users at gluster.org >>>>> <mailto:Gluster-users at gluster.org> >>>>> http://www.gluster.org/mailman/listinfo/gluster-users >>>>> <http://www.gluster.org/mailman/listinfo/gluster-users> >>>>> >>>>> -- >>>>> - Atin (atinm) >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Gluster-users mailing list >>>>> Gluster-users at gluster.org >>>>> http://www.gluster.org/mailman/listinfo/gluster-users >>>> >>> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20161207/f40af3ed/attachment.html>