Mohammed Rafi K C
2016-Dec-19 06:28 UTC
[Gluster-users] RE : Frequent connect and disconnect messages flooded in logs
Hi Micha, Sorry for the late reply. I was busy with some other things. If you have still the setup available Can you enable TRACE log level [1],[2] and see if you could find any log entries when the network start disconnecting. Basically I'm trying to find out any disconnection had occurred other than ping timer expire issue. [1] : gluster volume <volname> diagnostics.brick-log-level TRACE [2] : gluster volume <volname> diagnostics.client-log-level TRACE Regards Rafi KC On 12/08/2016 07:59 PM, Atin Mukherjee wrote:> > > On Thu, Dec 8, 2016 at 4:37 PM, Micha Ober <micha2k at gmail.com > <mailto:micha2k at gmail.com>> wrote: > > Hi Rafi, > > thank you for your support. It is greatly appreciated. > > Just some more thoughts from my side: > > There have been no reports from other users in *this* thread > until now, but I have found at least one user with a very simiar > problem in an older thread: > > https://www.gluster.org/pipermail/gluster-users/2014-November/019637.html > <https://www.gluster.org/pipermail/gluster-users/2014-November/019637.html> > > He is also reporting disconnects with no apparent reasons, > althogh his setup is a bit more complicated, also involving a > firewall. In our setup, all servers/clients are connected via 1 > GbE with no firewall or anything that might block/throttle > traffic. Also, we are using exactly the same software versions on > all nodes. > > > I can also find some reports in the bugtracker when searching for > "rpc_client_ping_timer_expired" and "rpc_clnt_ping_timer_expired" > (looks like spelling changed during versions). > > https://bugzilla.redhat.com/show_bug.cgi?id=1096729 > <https://bugzilla.redhat.com/show_bug.cgi?id=1096729> > > > Just FYI, this is a different issue, here GlusterD fails to handle the > volume of incoming requests on time since MT-epoll is not enabled here. > > > > https://bugzilla.redhat.com/show_bug.cgi?id=1370683 > <https://bugzilla.redhat.com/show_bug.cgi?id=1370683> > > But both reports involve large traffic/load on the bricks/disks, > which is not the case for out setup. > To give a ballpark figure: Over three days, 30 GiB were written. > And the data was not written at once, but continuously over the > whole time. > > > Just to be sure, I have checked the logfiles of one of the other > clusters right now, which are sitting in the same building, in the > same rack, even on the same switch, running the same jobs, but > with glusterfs 3.4.2 and I can see no disconnects in the logfiles. > So I can definitely rule out our infrastructure as problem. > > Regards, > Micha > > > > Am 07.12.2016 um 18:08 schrieb Mohammed Rafi K C: >> >> Hi Micha, >> >> This is great. I will provide you one debug build which has two >> fixes which I possible suspect for a frequent disconnect issue, >> though I don't have much data to validate my theory. So I will >> take one more day to dig in to that. >> >> Thanks for your support, and opensource++ >> >> Regards >> >> Rafi KC >> >> On 12/07/2016 05:02 AM, Micha Ober wrote: >>> Hi, >>> >>> thank you for your answer and even more for the question! >>> Until now, I was using FUSE. Today I changed all mounts to NFS >>> using the same 3.7.17 version. >>> >>> But: The problem is still the same. Now, the NFS logfile >>> contains lines like these: >>> >>> [2016-12-06 15:12:29.006325] C >>> [rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired] >>> 0-gv0-client-7: server X.X.18.62:49153 has not responded in the >>> last 42 seconds, disconnecting. >>> >>> Interestingly enough, the IP address X.X.18.62 is the same >>> machine! As I wrote earlier, each node serves both as a server >>> and a client, as each node contributes bricks to the volume. >>> Every server is connecting to itself via its hostname. For >>> example, the fstab on the node "giant2" looks like: >>> >>> #giant2:/gv0 /shared_data glusterfs defaults,noauto >>> 0 0 >>> #giant2:/gv2 /shared_slurm glusterfs defaults,noauto >>> 0 0 >>> >>> giant2:/gv0 /shared_data nfs >>> defaults,_netdev,vers=3 0 0 >>> giant2:/gv2 /shared_slurm nfs >>> defaults,_netdev,vers=3 0 0 >>> >>> So I understand the disconnects even less. >>> >>> I don't know if it's possible to create a dummy cluster which >>> exposes the same behaviour, because the disconnects only happen >>> when there are compute jobs running on those nodes - and they >>> are GPU compute jobs, so that's something which cannot be easily >>> emulated in a VM. >>> >>> As we have more clusters (which are running fine with an ancient >>> 3.4 version :-)) and we are currently not dependent on this >>> particular cluster (which may stay like this for this month, I >>> think) I should be able to deploy the debug build on the "real" >>> cluster, if you can provide a debug build. >>> >>> Regards and thanks, >>> Micha >>> >>> >>> >>> Am 06.12.2016 um 08:15 schrieb Mohammed Rafi K C: >>>> >>>> >>>> >>>> On 12/03/2016 12:56 AM, Micha Ober wrote: >>>>> ** Update: ** I have downgraded from 3.8.6 to 3.7.17 now, but >>>>> the problem still exists. >>>>> >>>>> Client log: http://paste.ubuntu.com/23569065/ >>>>> <http://paste.ubuntu.com/23569065/> >>>>> Brick log: http://paste.ubuntu.com/23569067/ >>>>> <http://paste.ubuntu.com/23569067/> >>>>> >>>>> Please note that each server has two bricks. >>>>> Whereas, according to the logs, one brick loses the connection >>>>> to all other hosts: >>>>> [2016-12-02 18:38:53.703301] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.219:49121 failed (Broken pipe) >>>>> [2016-12-02 18:38:53.703381] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.62:49118 failed (Broken pipe) >>>>> [2016-12-02 18:38:53.703380] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.107:49121 failed (Broken pipe) >>>>> [2016-12-02 18:38:53.703424] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.206:49120 failed (Broken pipe) >>>>> [2016-12-02 18:38:53.703359] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.58:49121 failed (Broken pipe) >>>>> >>>>> The SECOND brick on the SAME host is NOT affected, i.e. no disconnects! >>>>> As I said, the network connection is fine and the disks are idle. >>>>> The CPU always has 2 free cores. >>>>> >>>>> It looks like I have to downgrade to 3.4 now in order for the disconnects to stop. >>>> >>>> Hi Micha, >>>> >>>> Thanks for the update and sorry for what happened with gluster >>>> higher versions. I can understand the need for downgrade as it >>>> is a production setup. >>>> >>>> Can you tell me the clients used here ? whether it is a >>>> fuse,nfs,nfs-ganesha, smb or libgfapi ? >>>> >>>> Since I'm not able to reproduce the issue (I have been trying >>>> from last 3days) and the logs are not much helpful here (we >>>> don't have much logs in socket layer), Could you please create >>>> a dummy cluster and try to reproduce the issue? If then we can >>>> play with that volume and I could provide some debug build >>>> which we can use for further debugging? >>>> >>>> If you don't have bandwidth for this, please leave it ;). >>>> >>>> Regards >>>> Rafi KC >>>> >>>>> - Micha >>>>> >>>>> Am 30.11.2016 um 06:57 schrieb Mohammed Rafi K C: >>>>>> >>>>>> Hi Micha, >>>>>> >>>>>> I have changed the thread and subject so that your original >>>>>> thread remain same for your query. Let's try to fix the >>>>>> problem what you observed with 3.8.4, So I have started a new >>>>>> thread to discuss the frequent disconnect problem. >>>>>> >>>>>> *If any one else has experienced the same problem, please >>>>>> respond to the mail.* >>>>>> >>>>>> It would be very helpful if you could give us some more logs >>>>>> from clients and bricks. Also any reproducible steps will >>>>>> surely help to chase the problem further. >>>>>> >>>>>> Regards >>>>>> >>>>>> Rafi KC >>>>>> >>>>>> On 11/30/2016 04:44 AM, Micha Ober wrote: >>>>>>> I had opened another thread on this mailing list (Subject: >>>>>>> "After upgrade from 3.4.2 to 3.8.5 - High CPU usage >>>>>>> resulting in disconnects and split-brain"). >>>>>>> >>>>>>> The title may be a bit misleading now, as I am no longer >>>>>>> observing high CPU usage after upgrading to 3.8.6, but the >>>>>>> disconnects are still happening and the number of files in >>>>>>> split-brain is growing. >>>>>>> >>>>>>> Setup: 6 compute nodes, each serving as a glusterfs server >>>>>>> and client, Ubuntu 14.04, two bricks per node, >>>>>>> distribute-replicate >>>>>>> >>>>>>> I have two gluster volumes set up (one for scratch data, one >>>>>>> for the slurm scheduler). Only the scratch data volume shows >>>>>>> critical errors "[...] has not responded in the last 42 >>>>>>> seconds, disconnecting.". So I can rule out network >>>>>>> problems, the gigabit link between the nodes is not >>>>>>> saturated at all. The disks are almost idle (<10%). >>>>>>> >>>>>>> I have glusterfs 3.4.2 on Ubuntu 12.04 on a another compute >>>>>>> cluster, running fine since it was deployed. >>>>>>> I had glusterfs 3.4.2 on Ubuntu 14.04 on this cluster, >>>>>>> running fine for almost a year. >>>>>>> >>>>>>> After upgrading to 3.8.5, the problems (as described) >>>>>>> started. I would like to use some of the new features of the >>>>>>> newer versions (like bitrot), but the users can't run their >>>>>>> compute jobs right now because the result files are garbled. >>>>>>> >>>>>>> There also seems to be a bug report with a smiliar problem: >>>>>>> (but no progress) >>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1370683 >>>>>>> <https://bugzilla.redhat.com/show_bug.cgi?id=1370683> >>>>>>> >>>>>>> For me, ALL servers are affected (not isolated to one or two >>>>>>> servers) >>>>>>> >>>>>>> I also see messages like "INFO: task gpu_graphene_bv:4476 >>>>>>> blocked for more than 120 seconds." in the syslog. >>>>>>> >>>>>>> For completeness (gv0 is the scratch volume, gv2 the slurm >>>>>>> volume): >>>>>>> >>>>>>> [root at giant2: ~]# gluster v info >>>>>>> >>>>>>> Volume Name: gv0 >>>>>>> Type: Distributed-Replicate >>>>>>> Volume ID: 993ec7c9-e4bc-44d0-b7c4-2d977e622e86 >>>>>>> Status: Started >>>>>>> Snapshot Count: 0 >>>>>>> Number of Bricks: 6 x 2 = 12 >>>>>>> Transport-type: tcp >>>>>>> Bricks: >>>>>>> Brick1: giant1:/gluster/sdc/gv0 >>>>>>> Brick2: giant2:/gluster/sdc/gv0 >>>>>>> Brick3: giant3:/gluster/sdc/gv0 >>>>>>> Brick4: giant4:/gluster/sdc/gv0 >>>>>>> Brick5: giant5:/gluster/sdc/gv0 >>>>>>> Brick6: giant6:/gluster/sdc/gv0 >>>>>>> Brick7: giant1:/gluster/sdd/gv0 >>>>>>> Brick8: giant2:/gluster/sdd/gv0 >>>>>>> Brick9: giant3:/gluster/sdd/gv0 >>>>>>> Brick10: giant4:/gluster/sdd/gv0 >>>>>>> Brick11: giant5:/gluster/sdd/gv0 >>>>>>> Brick12: giant6:/gluster/sdd/gv0 >>>>>>> Options Reconfigured: >>>>>>> auth.allow: X.X.X.*,127.0.0.1 >>>>>>> nfs.disable: on >>>>>>> >>>>>>> Volume Name: gv2 >>>>>>> Type: Replicate >>>>>>> Volume ID: 30c78928-5f2c-4671-becc-8deaee1a7a8d >>>>>>> Status: Started >>>>>>> Snapshot Count: 0 >>>>>>> Number of Bricks: 1 x 2 = 2 >>>>>>> Transport-type: tcp >>>>>>> Bricks: >>>>>>> Brick1: giant1:/gluster/sdd/gv2 >>>>>>> Brick2: giant2:/gluster/sdd/gv2 >>>>>>> Options Reconfigured: >>>>>>> auth.allow: X.X.X.*,127.0.0.1 >>>>>>> cluster.granular-entry-heal: on >>>>>>> cluster.locking-scheme: granular >>>>>>> nfs.disable: on >>>>>>> >>>>>>> >>>>>>> 2016-11-30 0:10 GMT+01:00 Micha Ober <micha2k at gmail.com >>>>>>> <mailto:micha2k at gmail.com>>: >>>>>>> >>>>>>> There also seems to be a bug report with a smiliar >>>>>>> problem: (but no progress) >>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1370683 >>>>>>> <https://bugzilla.redhat.com/show_bug.cgi?id=1370683> >>>>>>> >>>>>>> For me, ALL servers are affected (not isolated to one or >>>>>>> two servers) >>>>>>> >>>>>>> I also see messages like "INFO: task >>>>>>> gpu_graphene_bv:4476 blocked for more than 120 seconds." >>>>>>> in the syslog. >>>>>>> >>>>>>> For completeness (gv0 is the scratch volume, gv2 the >>>>>>> slurm volume): >>>>>>> >>>>>>> [root at giant2: ~]# gluster v info >>>>>>> >>>>>>> Volume Name: gv0 >>>>>>> Type: Distributed-Replicate >>>>>>> Volume ID: 993ec7c9-e4bc-44d0-b7c4-2d977e622e86 >>>>>>> Status: Started >>>>>>> Snapshot Count: 0 >>>>>>> Number of Bricks: 6 x 2 = 12 >>>>>>> Transport-type: tcp >>>>>>> Bricks: >>>>>>> Brick1: giant1:/gluster/sdc/gv0 >>>>>>> Brick2: giant2:/gluster/sdc/gv0 >>>>>>> Brick3: giant3:/gluster/sdc/gv0 >>>>>>> Brick4: giant4:/gluster/sdc/gv0 >>>>>>> Brick5: giant5:/gluster/sdc/gv0 >>>>>>> Brick6: giant6:/gluster/sdc/gv0 >>>>>>> Brick7: giant1:/gluster/sdd/gv0 >>>>>>> Brick8: giant2:/gluster/sdd/gv0 >>>>>>> Brick9: giant3:/gluster/sdd/gv0 >>>>>>> Brick10: giant4:/gluster/sdd/gv0 >>>>>>> Brick11: giant5:/gluster/sdd/gv0 >>>>>>> Brick12: giant6:/gluster/sdd/gv0 >>>>>>> Options Reconfigured: >>>>>>> auth.allow: X.X.X.*,127.0.0.1 >>>>>>> nfs.disable: on >>>>>>> >>>>>>> Volume Name: gv2 >>>>>>> Type: Replicate >>>>>>> Volume ID: 30c78928-5f2c-4671-becc-8deaee1a7a8d >>>>>>> Status: Started >>>>>>> Snapshot Count: 0 >>>>>>> Number of Bricks: 1 x 2 = 2 >>>>>>> Transport-type: tcp >>>>>>> Bricks: >>>>>>> Brick1: giant1:/gluster/sdd/gv2 >>>>>>> Brick2: giant2:/gluster/sdd/gv2 >>>>>>> Options Reconfigured: >>>>>>> auth.allow: X.X.X.*,127.0.0.1 >>>>>>> cluster.granular-entry-heal: on >>>>>>> cluster.locking-scheme: granular >>>>>>> nfs.disable: on >>>>>>> >>>>>>> >>>>>>> 2016-11-29 19:21 GMT+01:00 Micha Ober <micha2k at gmail.com >>>>>>> <mailto:micha2k at gmail.com>>: >>>>>>> >>>>>>> I had opened another thread on this mailing list >>>>>>> (Subject: "After upgrade from 3.4.2 to 3.8.5 - High >>>>>>> CPU usage resulting in disconnects and split-brain"). >>>>>>> >>>>>>> The title may be a bit misleading now, as I am no >>>>>>> longer observing high CPU usage after upgrading to >>>>>>> 3.8.6, but the disconnects are still happening and >>>>>>> the number of files in split-brain is growing. >>>>>>> >>>>>>> Setup: 6 compute nodes, each serving as a glusterfs >>>>>>> server and client, Ubuntu 14.04, two bricks per >>>>>>> node, distribute-replicate >>>>>>> >>>>>>> I have two gluster volumes set up (one for scratch >>>>>>> data, one for the slurm scheduler). Only the scratch >>>>>>> data volume shows critical errors "[...] has not >>>>>>> responded in the last 42 seconds, disconnecting.". >>>>>>> So I can rule out network problems, the gigabit link >>>>>>> between the nodes is not saturated at all. The disks >>>>>>> are almost idle (<10%). >>>>>>> >>>>>>> I have glusterfs 3.4.2 on Ubuntu 12.04 on a another >>>>>>> compute cluster, running fine since it was deployed. >>>>>>> I had glusterfs 3.4.2 on Ubuntu 14.04 on this >>>>>>> cluster, running fine for almost a year. >>>>>>> >>>>>>> After upgrading to 3.8.5, the problems (as >>>>>>> described) started. I would like to use some of the >>>>>>> new features of the newer versions (like bitrot), >>>>>>> but the users can't run their compute jobs right now >>>>>>> because the result files are garbled. >>>>>>> >>>>>>> 2016-11-29 18:53 GMT+01:00 Atin Mukherjee >>>>>>> <amukherj at redhat.com <mailto:amukherj at redhat.com>>: >>>>>>> >>>>>>> Would you be able to share what is not working >>>>>>> for you in 3.8.x (mention the exact version). >>>>>>> 3.4 is quite old and falling back to an >>>>>>> unsupported version doesn't look a feasible option. >>>>>>> >>>>>>> On Tue, 29 Nov 2016 at 17:01, Micha Ober >>>>>>> <micha2k at gmail.com <mailto:micha2k at gmail.com>> >>>>>>> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I was using gluster 3.4 and upgraded to 3.8, >>>>>>> but that version showed to be unusable for >>>>>>> me. I now need to downgrade. >>>>>>> >>>>>>> I'm running Ubuntu 14.04. As upgrades of the >>>>>>> op version are irreversible, I guess I have >>>>>>> to delete all gluster volumes and re-create >>>>>>> them with the downgraded version. >>>>>>> >>>>>>> 0. Backup data >>>>>>> 1. Unmount all gluster volumes >>>>>>> 2. apt-get purge glusterfs-server >>>>>>> glusterfs-client >>>>>>> 3. Remove PPA for 3.8 >>>>>>> 4. Add PPA for older version >>>>>>> 5. apt-get install glusterfs-server >>>>>>> glusterfs-client >>>>>>> 6. Create volumes >>>>>>> >>>>>>> Is "purge" enough to delete all >>>>>>> configuration files of the currently >>>>>>> installed version or do I need to manually >>>>>>> clear some residues before installing an >>>>>>> older version? >>>>>>> >>>>>>> Thanks. >>>>>>> _______________________________________________ >>>>>>> Gluster-users mailing list >>>>>>> Gluster-users at gluster.org >>>>>>> <mailto:Gluster-users at gluster.org> >>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users >>>>>>> <http://www.gluster.org/mailman/listinfo/gluster-users> >>>>>>> >>>>>>> -- >>>>>>> - Atin (atinm) >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Gluster-users mailing list >>>>>>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> >>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users >>>>>>> <http://www.gluster.org/mailman/listinfo/gluster-users> >>>>> > -- > ~ Atin (atinm)-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20161219/178402e3/attachment.html>
Mohammed Rafi K C
2016-Dec-19 15:09 UTC
[Gluster-users] RE : Frequent connect and disconnect messages flooded in logs
Hi Micha, Can you please also see if there is any error messages in dmesg ? Basically I'm trying to see whether your hitting issues described in https://bugzilla.kernel.org/show_bug.cgi?id=73831 . Regards Rafi KC On 12/19/2016 11:58 AM, Mohammed Rafi K C wrote:> > Hi Micha, > > Sorry for the late reply. I was busy with some other things. > > If you have still the setup available Can you enable TRACE log level > [1],[2] and see if you could find any log entries when the network > start disconnecting. Basically I'm trying to find out any > disconnection had occurred other than ping timer expire issue. > > > > [1] : gluster volume <volname> diagnostics.brick-log-level TRACE > > [2] : gluster volume <volname> diagnostics.client-log-level TRACE > > > Regards > > Rafi KC > > > On 12/08/2016 07:59 PM, Atin Mukherjee wrote: >> >> >> On Thu, Dec 8, 2016 at 4:37 PM, Micha Ober <micha2k at gmail.com >> <mailto:micha2k at gmail.com>> wrote: >> >> Hi Rafi, >> >> thank you for your support. It is greatly appreciated. >> >> Just some more thoughts from my side: >> >> There have been no reports from other users in *this* thread >> until now, but I have found at least one user with a very simiar >> problem in an older thread: >> >> https://www.gluster.org/pipermail/gluster-users/2014-November/019637.html >> <https://www.gluster.org/pipermail/gluster-users/2014-November/019637.html> >> >> He is also reporting disconnects with no apparent reasons, >> althogh his setup is a bit more complicated, also involving a >> firewall. In our setup, all servers/clients are connected via 1 >> GbE with no firewall or anything that might block/throttle >> traffic. Also, we are using exactly the same software versions on >> all nodes. >> >> >> I can also find some reports in the bugtracker when searching for >> "rpc_client_ping_timer_expired" and "rpc_clnt_ping_timer_expired" >> (looks like spelling changed during versions). >> >> https://bugzilla.redhat.com/show_bug.cgi?id=1096729 >> <https://bugzilla.redhat.com/show_bug.cgi?id=1096729> >> >> >> Just FYI, this is a different issue, here GlusterD fails to handle >> the volume of incoming requests on time since MT-epoll is not enabled >> here. >> >> >> >> https://bugzilla.redhat.com/show_bug.cgi?id=1370683 >> <https://bugzilla.redhat.com/show_bug.cgi?id=1370683> >> >> But both reports involve large traffic/load on the bricks/disks, >> which is not the case for out setup. >> To give a ballpark figure: Over three days, 30 GiB were written. >> And the data was not written at once, but continuously over the >> whole time. >> >> >> Just to be sure, I have checked the logfiles of one of the other >> clusters right now, which are sitting in the same building, in >> the same rack, even on the same switch, running the same jobs, >> but with glusterfs 3.4.2 and I can see no disconnects in the >> logfiles. So I can definitely rule out our infrastructure as problem. >> >> Regards, >> Micha >> >> >> >> Am 07.12.2016 um 18:08 schrieb Mohammed Rafi K C: >>> >>> Hi Micha, >>> >>> This is great. I will provide you one debug build which has two >>> fixes which I possible suspect for a frequent disconnect issue, >>> though I don't have much data to validate my theory. So I will >>> take one more day to dig in to that. >>> >>> Thanks for your support, and opensource++ >>> >>> Regards >>> >>> Rafi KC >>> >>> On 12/07/2016 05:02 AM, Micha Ober wrote: >>>> Hi, >>>> >>>> thank you for your answer and even more for the question! >>>> Until now, I was using FUSE. Today I changed all mounts to NFS >>>> using the same 3.7.17 version. >>>> >>>> But: The problem is still the same. Now, the NFS logfile >>>> contains lines like these: >>>> >>>> [2016-12-06 15:12:29.006325] C >>>> [rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired] >>>> 0-gv0-client-7: server X.X.18.62:49153 has not responded in the >>>> last 42 seconds, disconnecting. >>>> >>>> Interestingly enough, the IP address X.X.18.62 is the same >>>> machine! As I wrote earlier, each node serves both as a server >>>> and a client, as each node contributes bricks to the volume. >>>> Every server is connecting to itself via its hostname. For >>>> example, the fstab on the node "giant2" looks like: >>>> >>>> #giant2:/gv0 /shared_data glusterfs defaults,noauto >>>> 0 0 >>>> #giant2:/gv2 /shared_slurm glusterfs defaults,noauto >>>> 0 0 >>>> >>>> giant2:/gv0 /shared_data nfs >>>> defaults,_netdev,vers=3 0 0 >>>> giant2:/gv2 /shared_slurm nfs >>>> defaults,_netdev,vers=3 0 0 >>>> >>>> So I understand the disconnects even less. >>>> >>>> I don't know if it's possible to create a dummy cluster which >>>> exposes the same behaviour, because the disconnects only happen >>>> when there are compute jobs running on those nodes - and they >>>> are GPU compute jobs, so that's something which cannot be >>>> easily emulated in a VM. >>>> >>>> As we have more clusters (which are running fine with an >>>> ancient 3.4 version :-)) and we are currently not dependent on >>>> this particular cluster (which may stay like this for this >>>> month, I think) I should be able to deploy the debug build on >>>> the "real" cluster, if you can provide a debug build. >>>> >>>> Regards and thanks, >>>> Micha >>>> >>>> >>>> >>>> Am 06.12.2016 um 08:15 schrieb Mohammed Rafi K C: >>>>> >>>>> >>>>> >>>>> On 12/03/2016 12:56 AM, Micha Ober wrote: >>>>>> ** Update: ** I have downgraded from 3.8.6 to 3.7.17 now, but >>>>>> the problem still exists. >>>>>> >>>>>> Client log: http://paste.ubuntu.com/23569065/ >>>>>> Brick log: http://paste.ubuntu.com/23569067/ >>>>>> >>>>>> Please note that each server has two bricks. >>>>>> Whereas, according to the logs, one brick loses the >>>>>> connection to all other hosts: >>>>>> [2016-12-02 18:38:53.703301] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.219:49121 failed (Broken pipe) >>>>>> [2016-12-02 18:38:53.703381] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.62:49118 failed (Broken pipe) >>>>>> [2016-12-02 18:38:53.703380] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.107:49121 failed (Broken pipe) >>>>>> [2016-12-02 18:38:53.703424] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.206:49120 failed (Broken pipe) >>>>>> [2016-12-02 18:38:53.703359] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.58:49121 failed (Broken pipe) >>>>>> >>>>>> The SECOND brick on the SAME host is NOT affected, i.e. no disconnects! >>>>>> As I said, the network connection is fine and the disks are idle. >>>>>> The CPU always has 2 free cores. >>>>>> >>>>>> It looks like I have to downgrade to 3.4 now in order for the disconnects to stop. >>>>> >>>>> Hi Micha, >>>>> >>>>> Thanks for the update and sorry for what happened with gluster >>>>> higher versions. I can understand the need for downgrade as it >>>>> is a production setup. >>>>> >>>>> Can you tell me the clients used here ? whether it is a >>>>> fuse,nfs,nfs-ganesha, smb or libgfapi ? >>>>> >>>>> Since I'm not able to reproduce the issue (I have been trying >>>>> from last 3days) and the logs are not much helpful here (we >>>>> don't have much logs in socket layer), Could you please create >>>>> a dummy cluster and try to reproduce the issue? If then we can >>>>> play with that volume and I could provide some debug build >>>>> which we can use for further debugging? >>>>> >>>>> If you don't have bandwidth for this, please leave it ;). >>>>> >>>>> Regards >>>>> Rafi KC >>>>> >>>>>> - Micha >>>>>> >>>>>> Am 30.11.2016 um 06:57 schrieb Mohammed Rafi K C: >>>>>>> >>>>>>> Hi Micha, >>>>>>> >>>>>>> I have changed the thread and subject so that your original >>>>>>> thread remain same for your query. Let's try to fix the >>>>>>> problem what you observed with 3.8.4, So I have started a >>>>>>> new thread to discuss the frequent disconnect problem. >>>>>>> >>>>>>> *If any one else has experienced the same problem, please >>>>>>> respond to the mail.* >>>>>>> >>>>>>> It would be very helpful if you could give us some more logs >>>>>>> from clients and bricks. Also any reproducible steps will >>>>>>> surely help to chase the problem further. >>>>>>> >>>>>>> Regards >>>>>>> >>>>>>> Rafi KC >>>>>>> >>>>>>> On 11/30/2016 04:44 AM, Micha Ober wrote: >>>>>>>> I had opened another thread on this mailing list (Subject: >>>>>>>> "After upgrade from 3.4.2 to 3.8.5 - High CPU usage >>>>>>>> resulting in disconnects and split-brain"). >>>>>>>> >>>>>>>> The title may be a bit misleading now, as I am no longer >>>>>>>> observing high CPU usage after upgrading to 3.8.6, but the >>>>>>>> disconnects are still happening and the number of files in >>>>>>>> split-brain is growing. >>>>>>>> >>>>>>>> Setup: 6 compute nodes, each serving as a glusterfs server >>>>>>>> and client, Ubuntu 14.04, two bricks per node, >>>>>>>> distribute-replicate >>>>>>>> >>>>>>>> I have two gluster volumes set up (one for scratch data, >>>>>>>> one for the slurm scheduler). Only the scratch data volume >>>>>>>> shows critical errors "[...] has not responded in the last >>>>>>>> 42 seconds, disconnecting.". So I can rule out network >>>>>>>> problems, the gigabit link between the nodes is not >>>>>>>> saturated at all. The disks are almost idle (<10%). >>>>>>>> >>>>>>>> I have glusterfs 3.4.2 on Ubuntu 12.04 on a another compute >>>>>>>> cluster, running fine since it was deployed. >>>>>>>> I had glusterfs 3.4.2 on Ubuntu 14.04 on this cluster, >>>>>>>> running fine for almost a year. >>>>>>>> >>>>>>>> After upgrading to 3.8.5, the problems (as described) >>>>>>>> started. I would like to use some of the new features of >>>>>>>> the newer versions (like bitrot), but the users can't run >>>>>>>> their compute jobs right now because the result files are >>>>>>>> garbled. >>>>>>>> >>>>>>>> There also seems to be a bug report with a smiliar problem: >>>>>>>> (but no progress) >>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1370683 >>>>>>>> >>>>>>>> For me, ALL servers are affected (not isolated to one or >>>>>>>> two servers) >>>>>>>> >>>>>>>> I also see messages like "INFO: task gpu_graphene_bv:4476 >>>>>>>> blocked for more than 120 seconds." in the syslog. >>>>>>>> >>>>>>>> For completeness (gv0 is the scratch volume, gv2 the slurm >>>>>>>> volume): >>>>>>>> >>>>>>>> [root at giant2: ~]# gluster v info >>>>>>>> >>>>>>>> Volume Name: gv0 >>>>>>>> Type: Distributed-Replicate >>>>>>>> Volume ID: 993ec7c9-e4bc-44d0-b7c4-2d977e622e86 >>>>>>>> Status: Started >>>>>>>> Snapshot Count: 0 >>>>>>>> Number of Bricks: 6 x 2 = 12 >>>>>>>> Transport-type: tcp >>>>>>>> Bricks: >>>>>>>> Brick1: giant1:/gluster/sdc/gv0 >>>>>>>> Brick2: giant2:/gluster/sdc/gv0 >>>>>>>> Brick3: giant3:/gluster/sdc/gv0 >>>>>>>> Brick4: giant4:/gluster/sdc/gv0 >>>>>>>> Brick5: giant5:/gluster/sdc/gv0 >>>>>>>> Brick6: giant6:/gluster/sdc/gv0 >>>>>>>> Brick7: giant1:/gluster/sdd/gv0 >>>>>>>> Brick8: giant2:/gluster/sdd/gv0 >>>>>>>> Brick9: giant3:/gluster/sdd/gv0 >>>>>>>> Brick10: giant4:/gluster/sdd/gv0 >>>>>>>> Brick11: giant5:/gluster/sdd/gv0 >>>>>>>> Brick12: giant6:/gluster/sdd/gv0 >>>>>>>> Options Reconfigured: >>>>>>>> auth.allow: X.X.X.*,127.0.0.1 >>>>>>>> nfs.disable: on >>>>>>>> >>>>>>>> Volume Name: gv2 >>>>>>>> Type: Replicate >>>>>>>> Volume ID: 30c78928-5f2c-4671-becc-8deaee1a7a8d >>>>>>>> Status: Started >>>>>>>> Snapshot Count: 0 >>>>>>>> Number of Bricks: 1 x 2 = 2 >>>>>>>> Transport-type: tcp >>>>>>>> Bricks: >>>>>>>> Brick1: giant1:/gluster/sdd/gv2 >>>>>>>> Brick2: giant2:/gluster/sdd/gv2 >>>>>>>> Options Reconfigured: >>>>>>>> auth.allow: X.X.X.*,127.0.0.1 >>>>>>>> cluster.granular-entry-heal: on >>>>>>>> cluster.locking-scheme: granular >>>>>>>> nfs.disable: on >>>>>>>> >>>>>>>> >>>>>>>> 2016-11-30 0:10 GMT+01:00 Micha Ober <micha2k at gmail.com>: >>>>>>>> >>>>>>>> There also seems to be a bug report with a smiliar >>>>>>>> problem: (but no progress) >>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1370683 >>>>>>>> >>>>>>>> For me, ALL servers are affected (not isolated to one >>>>>>>> or two servers) >>>>>>>> >>>>>>>> I also see messages like "INFO: task >>>>>>>> gpu_graphene_bv:4476 blocked for more than 120 >>>>>>>> seconds." in the syslog. >>>>>>>> >>>>>>>> For completeness (gv0 is the scratch volume, gv2 the >>>>>>>> slurm volume): >>>>>>>> >>>>>>>> [root at giant2: ~]# gluster v info >>>>>>>> >>>>>>>> Volume Name: gv0 >>>>>>>> Type: Distributed-Replicate >>>>>>>> Volume ID: 993ec7c9-e4bc-44d0-b7c4-2d977e622e86 >>>>>>>> Status: Started >>>>>>>> Snapshot Count: 0 >>>>>>>> Number of Bricks: 6 x 2 = 12 >>>>>>>> Transport-type: tcp >>>>>>>> Bricks: >>>>>>>> Brick1: giant1:/gluster/sdc/gv0 >>>>>>>> Brick2: giant2:/gluster/sdc/gv0 >>>>>>>> Brick3: giant3:/gluster/sdc/gv0 >>>>>>>> Brick4: giant4:/gluster/sdc/gv0 >>>>>>>> Brick5: giant5:/gluster/sdc/gv0 >>>>>>>> Brick6: giant6:/gluster/sdc/gv0 >>>>>>>> Brick7: giant1:/gluster/sdd/gv0 >>>>>>>> Brick8: giant2:/gluster/sdd/gv0 >>>>>>>> Brick9: giant3:/gluster/sdd/gv0 >>>>>>>> Brick10: giant4:/gluster/sdd/gv0 >>>>>>>> Brick11: giant5:/gluster/sdd/gv0 >>>>>>>> Brick12: giant6:/gluster/sdd/gv0 >>>>>>>> Options Reconfigured: >>>>>>>> auth.allow: X.X.X.*,127.0.0.1 >>>>>>>> nfs.disable: on >>>>>>>> >>>>>>>> Volume Name: gv2 >>>>>>>> Type: Replicate >>>>>>>> Volume ID: 30c78928-5f2c-4671-becc-8deaee1a7a8d >>>>>>>> Status: Started >>>>>>>> Snapshot Count: 0 >>>>>>>> Number of Bricks: 1 x 2 = 2 >>>>>>>> Transport-type: tcp >>>>>>>> Bricks: >>>>>>>> Brick1: giant1:/gluster/sdd/gv2 >>>>>>>> Brick2: giant2:/gluster/sdd/gv2 >>>>>>>> Options Reconfigured: >>>>>>>> auth.allow: X.X.X.*,127.0.0.1 >>>>>>>> cluster.granular-entry-heal: on >>>>>>>> cluster.locking-scheme: granular >>>>>>>> nfs.disable: on >>>>>>>> >>>>>>>> >>>>>>>> 2016-11-29 19:21 GMT+01:00 Micha Ober <micha2k at gmail.com>: >>>>>>>> >>>>>>>> I had opened another thread on this mailing list >>>>>>>> (Subject: "After upgrade from 3.4.2 to 3.8.5 - High >>>>>>>> CPU usage resulting in disconnects and split-brain"). >>>>>>>> >>>>>>>> The title may be a bit misleading now, as I am no >>>>>>>> longer observing high CPU usage after upgrading to >>>>>>>> 3.8.6, but the disconnects are still happening and >>>>>>>> the number of files in split-brain is growing. >>>>>>>> >>>>>>>> Setup: 6 compute nodes, each serving as a glusterfs >>>>>>>> server and client, Ubuntu 14.04, two bricks per >>>>>>>> node, distribute-replicate >>>>>>>> >>>>>>>> I have two gluster volumes set up (one for scratch >>>>>>>> data, one for the slurm scheduler). Only the >>>>>>>> scratch data volume shows critical errors "[...] >>>>>>>> has not responded in the last 42 seconds, >>>>>>>> disconnecting.". So I can rule out network >>>>>>>> problems, the gigabit link between the nodes is not >>>>>>>> saturated at all. The disks are almost idle (<10%). >>>>>>>> >>>>>>>> I have glusterfs 3.4.2 on Ubuntu 12.04 on a another >>>>>>>> compute cluster, running fine since it was deployed. >>>>>>>> I had glusterfs 3.4.2 on Ubuntu 14.04 on this >>>>>>>> cluster, running fine for almost a year. >>>>>>>> >>>>>>>> After upgrading to 3.8.5, the problems (as >>>>>>>> described) started. I would like to use some of the >>>>>>>> new features of the newer versions (like bitrot), >>>>>>>> but the users can't run their compute jobs right >>>>>>>> now because the result files are garbled. >>>>>>>> >>>>>>>> 2016-11-29 18:53 GMT+01:00 Atin Mukherjee >>>>>>>> <amukherj at redhat.com>: >>>>>>>> >>>>>>>> Would you be able to share what is not working >>>>>>>> for you in 3.8.x (mention the exact version). >>>>>>>> 3.4 is quite old and falling back to an >>>>>>>> unsupported version doesn't look a feasible option. >>>>>>>> >>>>>>>> On Tue, 29 Nov 2016 at 17:01, Micha Ober >>>>>>>> <micha2k at gmail.com> wrote: >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I was using gluster 3.4 and upgraded to >>>>>>>> 3.8, but that version showed to be unusable >>>>>>>> for me. I now need to downgrade. >>>>>>>> >>>>>>>> I'm running Ubuntu 14.04. As upgrades of >>>>>>>> the op version are irreversible, I guess I >>>>>>>> have to delete all gluster volumes and >>>>>>>> re-create them with the downgraded version. >>>>>>>> >>>>>>>> 0. Backup data >>>>>>>> 1. Unmount all gluster volumes >>>>>>>> 2. apt-get purge glusterfs-server >>>>>>>> glusterfs-client >>>>>>>> 3. Remove PPA for 3.8 >>>>>>>> 4. Add PPA for older version >>>>>>>> 5. apt-get install glusterfs-server >>>>>>>> glusterfs-client >>>>>>>> 6. Create volumes >>>>>>>> >>>>>>>> Is "purge" enough to delete all >>>>>>>> configuration files of the currently >>>>>>>> installed version or do I need to manually >>>>>>>> clear some residues before installing an >>>>>>>> older version? >>>>>>>> >>>>>>>> Thanks. >>>>>>>> _______________________________________________ >>>>>>>> Gluster-users mailing list >>>>>>>> Gluster-users at gluster.org >>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users >>>>>>>> >>>>>>>> -- >>>>>>>> - Atin (atinm) >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Gluster-users mailing list >>>>>>>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> >>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users >>>>>>>> <http://www.gluster.org/mailman/listinfo/gluster-users> >>>>>> >> -- >> ~ Atin (atinm)-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20161219/6833abc6/attachment.html>
Micha Ober
2016-Dec-20 12:31 UTC
[Gluster-users] RE : Frequent connect and disconnect messages flooded in logs
Hi Rafi, here are the log files: NFS: http://paste.ubuntu.com/23658653/ Brick: http://paste.ubuntu.com/23658656/ The brick log is of the brick which has caused the last disconnect at 2016-12-20 06:46:36 (0-gv0-client-7). For completeness, here is also dmesg output: http://paste.ubuntu.com/23658691/ Regards, Micha 2016-12-19 7:28 GMT+01:00 Mohammed Rafi K C <rkavunga at redhat.com>:> Hi Micha, > > Sorry for the late reply. I was busy with some other things. > > If you have still the setup available Can you enable TRACE log level > [1],[2] and see if you could find any log entries when the network start > disconnecting. Basically I'm trying to find out any disconnection had > occurred other than ping timer expire issue. > > > > [1] : gluster volume <volname> diagnostics.brick-log-level TRACE > > [2] : gluster volume <volname> diagnostics.client-log-level TRACE > > > Regards > > Rafi KC > > On 12/08/2016 07:59 PM, Atin Mukherjee wrote: > > > > On Thu, Dec 8, 2016 at 4:37 PM, Micha Ober <micha2k at gmail.com> wrote: > >> Hi Rafi, >> >> thank you for your support. It is greatly appreciated. >> >> Just some more thoughts from my side: >> >> There have been no reports from other users in *this* thread until now, >> but I have found at least one user with a very simiar problem in an older >> thread: >> >> https://www.gluster.org/pipermail/gluster-users/2014-November/019637.html >> >> He is also reporting disconnects with no apparent reasons, althogh his >> setup is a bit more complicated, also involving a firewall. In our setup, >> all servers/clients are connected via 1 GbE with no firewall or anything >> that might block/throttle traffic. Also, we are using exactly the same >> software versions on all nodes. >> >> >> I can also find some reports in the bugtracker when searching for >> "rpc_client_ping_timer_expired" and "rpc_clnt_ping_timer_expired" (looks >> like spelling changed during versions). >> >> https://bugzilla.redhat.com/show_bug.cgi?id=1096729 >> > > Just FYI, this is a different issue, here GlusterD fails to handle the > volume of incoming requests on time since MT-epoll is not enabled here. > > >> >> https://bugzilla.redhat.com/show_bug.cgi?id=1370683 >> >> But both reports involve large traffic/load on the bricks/disks, which is >> not the case for out setup. >> To give a ballpark figure: Over three days, 30 GiB were written. And the >> data was not written at once, but continuously over the whole time. >> >> >> Just to be sure, I have checked the logfiles of one of the other clusters >> right now, which are sitting in the same building, in the same rack, even >> on the same switch, running the same jobs, but with glusterfs 3.4.2 and I >> can see no disconnects in the logfiles. So I can definitely rule out our >> infrastructure as problem. >> >> Regards, >> Micha >> >> >> >> Am 07.12.2016 um 18:08 schrieb Mohammed Rafi K C: >> >> Hi Micha, >> >> This is great. I will provide you one debug build which has two fixes >> which I possible suspect for a frequent disconnect issue, though I don't >> have much data to validate my theory. So I will take one more day to dig in >> to that. >> >> Thanks for your support, and opensource++ >> >> Regards >> >> Rafi KC >> On 12/07/2016 05:02 AM, Micha Ober wrote: >> >> Hi, >> >> thank you for your answer and even more for the question! >> Until now, I was using FUSE. Today I changed all mounts to NFS using the >> same 3.7.17 version. >> >> But: The problem is still the same. Now, the NFS logfile contains lines >> like these: >> >> [2016-12-06 15:12:29.006325] C [rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired] >> 0-gv0-client-7: server X.X.18.62:49153 has not responded in the last 42 >> seconds, disconnecting. >> >> Interestingly enough, the IP address X.X.18.62 is the same machine! As I >> wrote earlier, each node serves both as a server and a client, as each node >> contributes bricks to the volume. Every server is connecting to itself via >> its hostname. For example, the fstab on the node "giant2" looks like: >> >> #giant2:/gv0 /shared_data glusterfs defaults,noauto 0 0 >> #giant2:/gv2 /shared_slurm glusterfs defaults,noauto 0 0 >> >> giant2:/gv0 /shared_data nfs defaults,_netdev,vers=3 >> 0 0 >> giant2:/gv2 /shared_slurm nfs defaults,_netdev,vers=3 >> 0 0 >> >> So I understand the disconnects even less. >> >> I don't know if it's possible to create a dummy cluster which exposes the >> same behaviour, because the disconnects only happen when there are compute >> jobs running on those nodes - and they are GPU compute jobs, so that's >> something which cannot be easily emulated in a VM. >> >> As we have more clusters (which are running fine with an ancient 3.4 >> version :-)) and we are currently not dependent on this particular cluster >> (which may stay like this for this month, I think) I should be able to >> deploy the debug build on the "real" cluster, if you can provide a debug >> build. >> >> Regards and thanks, >> Micha >> >> >> >> Am 06.12.2016 um 08:15 schrieb Mohammed Rafi K C: >> >> >> >> On 12/03/2016 12:56 AM, Micha Ober wrote: >> >> ** Update: ** I have downgraded from 3.8.6 to 3.7.17 now, but the problem >> still exists. >> >> >> Client log: <http://paste.ubuntu.com/23569065/>http://paste.ubuntu.com/ >> 23569065/ >> Brick log: <http://paste.ubuntu.com/23569067/>http://paste.ubuntu.com/ >> 23569067/ >> >> Please note that each server has two bricks. >> Whereas, according to the logs, one brick loses the connection to all >> other hosts: >> >> [2016-12-02 18:38:53.703301] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.219:49121 failed (Broken pipe) >> [2016-12-02 18:38:53.703381] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.62:49118 failed (Broken pipe) >> [2016-12-02 18:38:53.703380] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.107:49121 failed (Broken pipe) >> [2016-12-02 18:38:53.703424] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.206:49120 failed (Broken pipe) >> [2016-12-02 18:38:53.703359] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.58:49121 failed (Broken pipe) >> >> The SECOND brick on the SAME host is NOT affected, i.e. no disconnects! >> As I said, the network connection is fine and the disks are idle. >> The CPU always has 2 free cores. >> >> It looks like I have to downgrade to 3.4 now in order for the disconnects to stop. >> >> >> Hi Micha, >> >> Thanks for the update and sorry for what happened with gluster higher >> versions. I can understand the need for downgrade as it is a production >> setup. >> >> Can you tell me the clients used here ? whether it is a >> fuse,nfs,nfs-ganesha, smb or libgfapi ? >> >> Since I'm not able to reproduce the issue (I have been trying from last >> 3days) and the logs are not much helpful here (we don't have much logs in >> socket layer), Could you please create a dummy cluster and try to reproduce >> the issue? If then we can play with that volume and I could provide some >> debug build which we can use for further debugging? >> >> If you don't have bandwidth for this, please leave it ;). >> >> Regards >> Rafi KC >> >> - Micha >> >> >> Am 30.11.2016 um 06:57 schrieb Mohammed Rafi K C: >> >> Hi Micha, >> >> I have changed the thread and subject so that your original thread remain >> same for your query. Let's try to fix the problem what you observed with >> 3.8.4, So I have started a new thread to discuss the frequent disconnect >> problem. >> >> *If any one else has experienced the same problem, please respond to the >> mail.* >> >> It would be very helpful if you could give us some more logs from clients >> and bricks. Also any reproducible steps will surely help to chase the >> problem further. >> >> Regards >> >> Rafi KC >> On 11/30/2016 04:44 AM, Micha Ober wrote: >> >> I had opened another thread on this mailing list (Subject: "After upgrade >> from 3.4.2 to 3.8.5 - High CPU usage resulting in disconnects and >> split-brain"). >> >> The title may be a bit misleading now, as I am no longer observing high >> CPU usage after upgrading to 3.8.6, but the disconnects are still happening >> and the number of files in split-brain is growing. >> >> Setup: 6 compute nodes, each serving as a glusterfs server and client, >> Ubuntu 14.04, two bricks per node, distribute-replicate >> >> I have two gluster volumes set up (one for scratch data, one for the >> slurm scheduler). Only the scratch data volume shows critical errors "[...] >> has not responded in the last 42 seconds, disconnecting.". So I can rule >> out network problems, the gigabit link between the nodes is not saturated >> at all. The disks are almost idle (<10%). >> >> I have glusterfs 3.4.2 on Ubuntu 12.04 on a another compute cluster, >> running fine since it was deployed. >> I had glusterfs 3.4.2 on Ubuntu 14.04 on this cluster, running fine for >> almost a year. >> >> After upgrading to 3.8.5, the problems (as described) started. I would >> like to use some of the new features of the newer versions (like bitrot), >> but the users can't run their compute jobs right now because the result >> files are garbled. >> >> There also seems to be a bug report with a smiliar problem: (but no >> progress) >> <https://bugzilla.redhat.com/show_bug.cgi?id=1370683> >> https://bugzilla.redhat.com/show_bug.cgi?id=1370683 >> >> For me, ALL servers are affected (not isolated to one or two servers) >> >> I also see messages like "INFO: task gpu_graphene_bv:4476 blocked for >> more than 120 seconds." in the syslog. >> >> For completeness (gv0 is the scratch volume, gv2 the slurm volume): >> >> [root at giant2: ~]# gluster v info >> >> Volume Name: gv0 >> Type: Distributed-Replicate >> Volume ID: 993ec7c9-e4bc-44d0-b7c4-2d977e622e86 >> Status: Started >> Snapshot Count: 0 >> Number of Bricks: 6 x 2 = 12 >> Transport-type: tcp >> Bricks: >> Brick1: giant1:/gluster/sdc/gv0 >> Brick2: giant2:/gluster/sdc/gv0 >> Brick3: giant3:/gluster/sdc/gv0 >> Brick4: giant4:/gluster/sdc/gv0 >> Brick5: giant5:/gluster/sdc/gv0 >> Brick6: giant6:/gluster/sdc/gv0 >> Brick7: giant1:/gluster/sdd/gv0 >> Brick8: giant2:/gluster/sdd/gv0 >> Brick9: giant3:/gluster/sdd/gv0 >> Brick10: giant4:/gluster/sdd/gv0 >> Brick11: giant5:/gluster/sdd/gv0 >> Brick12: giant6:/gluster/sdd/gv0 >> Options Reconfigured: >> auth.allow: X.X.X.*,127.0.0.1 >> nfs.disable: on >> >> Volume Name: gv2 >> Type: Replicate >> Volume ID: 30c78928-5f2c-4671-becc-8deaee1a7a8d >> Status: Started >> Snapshot Count: 0 >> Number of Bricks: 1 x 2 = 2 >> Transport-type: tcp >> Bricks: >> Brick1: giant1:/gluster/sdd/gv2 >> Brick2: giant2:/gluster/sdd/gv2 >> Options Reconfigured: >> auth.allow: X.X.X.*,127.0.0.1 >> cluster.granular-entry-heal: on >> cluster.locking-scheme: granular >> nfs.disable: on >> >> >> 2016-11-30 0:10 GMT+01:00 Micha Ober < <micha2k at gmail.com> >> micha2k at gmail.com>: >> >>> There also seems to be a bug report with a smiliar problem: (but no >>> progress) >>> <https://bugzilla.redhat.com/show_bug.cgi?id=1370683> >>> https://bugzilla.redhat.com/show_bug.cgi?id=1370683 >>> >>> For me, ALL servers are affected (not isolated to one or two servers) >>> >>> I also see messages like "INFO: task gpu_graphene_bv:4476 blocked for >>> more than 120 seconds." in the syslog. >>> >>> For completeness (gv0 is the scratch volume, gv2 the slurm volume): >>> >>> [root at giant2: ~]# gluster v info >>> >>> Volume Name: gv0 >>> Type: Distributed-Replicate >>> Volume ID: 993ec7c9-e4bc-44d0-b7c4-2d977e622e86 >>> Status: Started >>> Snapshot Count: 0 >>> Number of Bricks: 6 x 2 = 12 >>> Transport-type: tcp >>> Bricks: >>> Brick1: giant1:/gluster/sdc/gv0 >>> Brick2: giant2:/gluster/sdc/gv0 >>> Brick3: giant3:/gluster/sdc/gv0 >>> Brick4: giant4:/gluster/sdc/gv0 >>> Brick5: giant5:/gluster/sdc/gv0 >>> Brick6: giant6:/gluster/sdc/gv0 >>> Brick7: giant1:/gluster/sdd/gv0 >>> Brick8: giant2:/gluster/sdd/gv0 >>> Brick9: giant3:/gluster/sdd/gv0 >>> Brick10: giant4:/gluster/sdd/gv0 >>> Brick11: giant5:/gluster/sdd/gv0 >>> Brick12: giant6:/gluster/sdd/gv0 >>> Options Reconfigured: >>> auth.allow: X.X.X.*,127.0.0.1 >>> nfs.disable: on >>> >>> Volume Name: gv2 >>> Type: Replicate >>> Volume ID: 30c78928-5f2c-4671-becc-8deaee1a7a8d >>> Status: Started >>> Snapshot Count: 0 >>> Number of Bricks: 1 x 2 = 2 >>> Transport-type: tcp >>> Bricks: >>> Brick1: giant1:/gluster/sdd/gv2 >>> Brick2: giant2:/gluster/sdd/gv2 >>> Options Reconfigured: >>> auth.allow: X.X.X.*,127.0.0.1 >>> cluster.granular-entry-heal: on >>> cluster.locking-scheme: granular >>> nfs.disable: on >>> >>> >>> 2016-11-29 19:21 GMT+01:00 Micha Ober < <micha2k at gmail.com> >>> micha2k at gmail.com>: >>> >>>> I had opened another thread on this mailing list (Subject: "After >>>> upgrade from 3.4.2 to 3.8.5 - High CPU usage resulting in disconnects and >>>> split-brain"). >>>> >>>> The title may be a bit misleading now, as I am no longer observing high >>>> CPU usage after upgrading to 3.8.6, but the disconnects are still happening >>>> and the number of files in split-brain is growing. >>>> >>>> Setup: 6 compute nodes, each serving as a glusterfs server and client, >>>> Ubuntu 14.04, two bricks per node, distribute-replicate >>>> >>>> I have two gluster volumes set up (one for scratch data, one for the >>>> slurm scheduler). Only the scratch data volume shows critical errors "[...] >>>> has not responded in the last 42 seconds, disconnecting.". So I can rule >>>> out network problems, the gigabit link between the nodes is not saturated >>>> at all. The disks are almost idle (<10%). >>>> >>>> I have glusterfs 3.4.2 on Ubuntu 12.04 on a another compute cluster, >>>> running fine since it was deployed. >>>> I had glusterfs 3.4.2 on Ubuntu 14.04 on this cluster, running fine for >>>> almost a year. >>>> >>>> After upgrading to 3.8.5, the problems (as described) started. I would >>>> like to use some of the new features of the newer versions (like bitrot), >>>> but the users can't run their compute jobs right now because the result >>>> files are garbled. >>>> >>>> 2016-11-29 18:53 GMT+01:00 Atin Mukherjee < <amukherj at redhat.com> >>>> amukherj at redhat.com>: >>>> >>>>> Would you be able to share what is not working for you in 3.8.x >>>>> (mention the exact version). 3.4 is quite old and falling back to an >>>>> unsupported version doesn't look a feasible option. >>>>> >>>>> On Tue, 29 Nov 2016 at 17:01, Micha Ober < <micha2k at gmail.com> >>>>> micha2k at gmail.com> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I was using gluster 3.4 and upgraded to 3.8, but that version showed >>>>>> to be unusable for me. I now need to downgrade. >>>>>> >>>>>> I'm running Ubuntu 14.04. As upgrades of the op version >>>>>> are irreversible, I guess I have to delete all gluster volumes and >>>>>> re-create them with the downgraded version. >>>>>> >>>>>> 0. Backup data >>>>>> 1. Unmount all gluster volumes >>>>>> 2. apt-get purge glusterfs-server glusterfs-client >>>>>> 3. Remove PPA for 3.8 >>>>>> 4. Add PPA for older version >>>>>> 5. apt-get install glusterfs-server glusterfs-client >>>>>> 6. Create volumes >>>>>> >>>>>> Is "purge" enough to delete all configuration files of the currently >>>>>> installed version or do I need to manually clear some residues before >>>>>> installing an older version? >>>>>> >>>>>> Thanks. >>>>>> _______________________________________________ >>>>>> Gluster-users mailing list >>>>>> <Gluster-users at gluster.org>Gluster-users at gluster.org >>>>>> <http://www.gluster.org/mailman/listinfo/gluster-users> >>>>>> http://www.gluster.org/mailman/listinfo/gluster-users >>>>> >>>>> -- >>>>> - Atin (atinm) >>>>> >>>> >>>> >>> >> >> >> _______________________________________________ >> Gluster-users mailing listGluster-users at gluster.orghttp://www.gluster.org/mailman/listinfo/gluster-users >> >> -- > ~ Atin (atinm) > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20161220/74e76c2b/attachment.html>
Micha Ober
2017-Mar-08 18:35 UTC
[Gluster-users] RE : Frequent connect and disconnect messages flooded in logs
?Just to let you know: I have reverted back to glusterfs 3.4.2 and everything is working again. No more disconnects, no more errors in the kernel log. So there *has* to be some kind of regression in the newer versions?. Sadly, I guess, it will be hard to find. 2016-12-20 13:31 GMT+01:00 Micha Ober <micha2k at gmail.com>:> Hi Rafi, > > here are the log files: > > NFS: http://paste.ubuntu.com/23658653/ > Brick: http://paste.ubuntu.com/23658656/ > > The brick log is of the brick which has caused the last disconnect > at 2016-12-20 06:46:36 (0-gv0-client-7). > > For completeness, here is also dmesg output: http://paste.ubuntu. > com/23658691/ > > Regards, > Micha > > 2016-12-19 7:28 GMT+01:00 Mohammed Rafi K C <rkavunga at redhat.com>: > >> Hi Micha, >> >> Sorry for the late reply. I was busy with some other things. >> >> If you have still the setup available Can you enable TRACE log level >> [1],[2] and see if you could find any log entries when the network start >> disconnecting. Basically I'm trying to find out any disconnection had >> occurred other than ping timer expire issue. >> >> >> >> [1] : gluster volume <volname> diagnostics.brick-log-level TRACE >> >> [2] : gluster volume <volname> diagnostics.client-log-level TRACE >> >> >> Regards >> >> Rafi KC >> >> On 12/08/2016 07:59 PM, Atin Mukherjee wrote: >> >> >> >> On Thu, Dec 8, 2016 at 4:37 PM, Micha Ober <micha2k at gmail.com> wrote: >> >>> Hi Rafi, >>> >>> thank you for your support. It is greatly appreciated. >>> >>> Just some more thoughts from my side: >>> >>> There have been no reports from other users in *this* thread until now, >>> but I have found at least one user with a very simiar problem in an older >>> thread: >>> >>> https://www.gluster.org/pipermail/gluster-users/2014-Novembe >>> r/019637.html >>> >>> He is also reporting disconnects with no apparent reasons, althogh his >>> setup is a bit more complicated, also involving a firewall. In our setup, >>> all servers/clients are connected via 1 GbE with no firewall or anything >>> that might block/throttle traffic. Also, we are using exactly the same >>> software versions on all nodes. >>> >>> >>> I can also find some reports in the bugtracker when searching for >>> "rpc_client_ping_timer_expired" and "rpc_clnt_ping_timer_expired" >>> (looks like spelling changed during versions). >>> >>> https://bugzilla.redhat.com/show_bug.cgi?id=1096729 >>> >> >> Just FYI, this is a different issue, here GlusterD fails to handle the >> volume of incoming requests on time since MT-epoll is not enabled here. >> >> >>> >>> https://bugzilla.redhat.com/show_bug.cgi?id=1370683 >>> >>> But both reports involve large traffic/load on the bricks/disks, which >>> is not the case for out setup. >>> To give a ballpark figure: Over three days, 30 GiB were written. And the >>> data was not written at once, but continuously over the whole time. >>> >>> >>> Just to be sure, I have checked the logfiles of one of the other >>> clusters right now, which are sitting in the same building, in the same >>> rack, even on the same switch, running the same jobs, but with glusterfs >>> 3.4.2 and I can see no disconnects in the logfiles. So I can definitely >>> rule out our infrastructure as problem. >>> >>> Regards, >>> Micha >>> >>> >>> >>> Am 07.12.2016 um 18:08 schrieb Mohammed Rafi K C: >>> >>> Hi Micha, >>> >>> This is great. I will provide you one debug build which has two fixes >>> which I possible suspect for a frequent disconnect issue, though I don't >>> have much data to validate my theory. So I will take one more day to dig in >>> to that. >>> >>> Thanks for your support, and opensource++ >>> >>> Regards >>> >>> Rafi KC >>> On 12/07/2016 05:02 AM, Micha Ober wrote: >>> >>> Hi, >>> >>> thank you for your answer and even more for the question! >>> Until now, I was using FUSE. Today I changed all mounts to NFS using the >>> same 3.7.17 version. >>> >>> But: The problem is still the same. Now, the NFS logfile contains lines >>> like these: >>> >>> [2016-12-06 15:12:29.006325] C [rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired] >>> 0-gv0-client-7: server X.X.18.62:49153 has not responded in the last 42 >>> seconds, disconnecting. >>> >>> Interestingly enough, the IP address X.X.18.62 is the same machine! As >>> I wrote earlier, each node serves both as a server and a client, as each >>> node contributes bricks to the volume. Every server is connecting to itself >>> via its hostname. For example, the fstab on the node "giant2" looks like: >>> >>> #giant2:/gv0 /shared_data glusterfs defaults,noauto 0 0 >>> #giant2:/gv2 /shared_slurm glusterfs defaults,noauto 0 0 >>> >>> giant2:/gv0 /shared_data nfs defaults,_netdev,vers=3 >>> 0 0 >>> giant2:/gv2 /shared_slurm nfs defaults,_netdev,vers=3 >>> 0 0 >>> >>> So I understand the disconnects even less. >>> >>> I don't know if it's possible to create a dummy cluster which exposes >>> the same behaviour, because the disconnects only happen when there are >>> compute jobs running on those nodes - and they are GPU compute jobs, so >>> that's something which cannot be easily emulated in a VM. >>> >>> As we have more clusters (which are running fine with an ancient 3.4 >>> version :-)) and we are currently not dependent on this particular cluster >>> (which may stay like this for this month, I think) I should be able to >>> deploy the debug build on the "real" cluster, if you can provide a debug >>> build. >>> >>> Regards and thanks, >>> Micha >>> >>> >>> >>> Am 06.12.2016 um 08:15 schrieb Mohammed Rafi K C: >>> >>> >>> >>> On 12/03/2016 12:56 AM, Micha Ober wrote: >>> >>> ** Update: ** I have downgraded from 3.8.6 to 3.7.17 now, but the >>> problem still exists. >>> >>> >>> Client log: <http://paste.ubuntu.com/23569065/>http://paste.ubuntu.com/ >>> 23569065/ >>> Brick log: <http://paste.ubuntu.com/23569067/>http://paste.ubuntu.com/ >>> 23569067/ >>> >>> Please note that each server has two bricks. >>> Whereas, according to the logs, one brick loses the connection to all >>> other hosts: >>> >>> [2016-12-02 18:38:53.703301] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.219:49121 failed (Broken pipe) >>> [2016-12-02 18:38:53.703381] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.62:49118 failed (Broken pipe) >>> [2016-12-02 18:38:53.703380] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.107:49121 failed (Broken pipe) >>> [2016-12-02 18:38:53.703424] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.206:49120 failed (Broken pipe) >>> [2016-12-02 18:38:53.703359] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.58:49121 failed (Broken pipe) >>> >>> The SECOND brick on the SAME host is NOT affected, i.e. no disconnects! >>> As I said, the network connection is fine and the disks are idle. >>> The CPU always has 2 free cores. >>> >>> It looks like I have to downgrade to 3.4 now in order for the disconnects to stop. >>> >>> >>> Hi Micha, >>> >>> Thanks for the update and sorry for what happened with gluster higher >>> versions. I can understand the need for downgrade as it is a production >>> setup. >>> >>> Can you tell me the clients used here ? whether it is a >>> fuse,nfs,nfs-ganesha, smb or libgfapi ? >>> >>> Since I'm not able to reproduce the issue (I have been trying from last >>> 3days) and the logs are not much helpful here (we don't have much logs in >>> socket layer), Could you please create a dummy cluster and try to reproduce >>> the issue? If then we can play with that volume and I could provide some >>> debug build which we can use for further debugging? >>> >>> If you don't have bandwidth for this, please leave it ;). >>> >>> Regards >>> Rafi KC >>> >>> - Micha >>> >>> >>> Am 30.11.2016 um 06:57 schrieb Mohammed Rafi K C: >>> >>> Hi Micha, >>> >>> I have changed the thread and subject so that your original thread >>> remain same for your query. Let's try to fix the problem what you observed >>> with 3.8.4, So I have started a new thread to discuss the frequent >>> disconnect problem. >>> >>> *If any one else has experienced the same problem, please respond to the >>> mail.* >>> >>> It would be very helpful if you could give us some more logs from >>> clients and bricks. Also any reproducible steps will surely help to chase >>> the problem further. >>> >>> Regards >>> >>> Rafi KC >>> On 11/30/2016 04:44 AM, Micha Ober wrote: >>> >>> I had opened another thread on this mailing list (Subject: "After >>> upgrade from 3.4.2 to 3.8.5 - High CPU usage resulting in disconnects and >>> split-brain"). >>> >>> The title may be a bit misleading now, as I am no longer observing high >>> CPU usage after upgrading to 3.8.6, but the disconnects are still happening >>> and the number of files in split-brain is growing. >>> >>> Setup: 6 compute nodes, each serving as a glusterfs server and client, >>> Ubuntu 14.04, two bricks per node, distribute-replicate >>> >>> I have two gluster volumes set up (one for scratch data, one for the >>> slurm scheduler). Only the scratch data volume shows critical errors "[...] >>> has not responded in the last 42 seconds, disconnecting.". So I can rule >>> out network problems, the gigabit link between the nodes is not saturated >>> at all. The disks are almost idle (<10%). >>> >>> I have glusterfs 3.4.2 on Ubuntu 12.04 on a another compute cluster, >>> running fine since it was deployed. >>> I had glusterfs 3.4.2 on Ubuntu 14.04 on this cluster, running fine for >>> almost a year. >>> >>> After upgrading to 3.8.5, the problems (as described) started. I would >>> like to use some of the new features of the newer versions (like bitrot), >>> but the users can't run their compute jobs right now because the result >>> files are garbled. >>> >>> There also seems to be a bug report with a smiliar problem: (but no >>> progress) >>> <https://bugzilla.redhat.com/show_bug.cgi?id=1370683> >>> https://bugzilla.redhat.com/show_bug.cgi?id=1370683 >>> >>> For me, ALL servers are affected (not isolated to one or two servers) >>> >>> I also see messages like "INFO: task gpu_graphene_bv:4476 blocked for >>> more than 120 seconds." in the syslog. >>> >>> For completeness (gv0 is the scratch volume, gv2 the slurm volume): >>> >>> [root at giant2: ~]# gluster v info >>> >>> Volume Name: gv0 >>> Type: Distributed-Replicate >>> Volume ID: 993ec7c9-e4bc-44d0-b7c4-2d977e622e86 >>> Status: Started >>> Snapshot Count: 0 >>> Number of Bricks: 6 x 2 = 12 >>> Transport-type: tcp >>> Bricks: >>> Brick1: giant1:/gluster/sdc/gv0 >>> Brick2: giant2:/gluster/sdc/gv0 >>> Brick3: giant3:/gluster/sdc/gv0 >>> Brick4: giant4:/gluster/sdc/gv0 >>> Brick5: giant5:/gluster/sdc/gv0 >>> Brick6: giant6:/gluster/sdc/gv0 >>> Brick7: giant1:/gluster/sdd/gv0 >>> Brick8: giant2:/gluster/sdd/gv0 >>> Brick9: giant3:/gluster/sdd/gv0 >>> Brick10: giant4:/gluster/sdd/gv0 >>> Brick11: giant5:/gluster/sdd/gv0 >>> Brick12: giant6:/gluster/sdd/gv0 >>> Options Reconfigured: >>> auth.allow: X.X.X.*,127.0.0.1 >>> nfs.disable: on >>> >>> Volume Name: gv2 >>> Type: Replicate >>> Volume ID: 30c78928-5f2c-4671-becc-8deaee1a7a8d >>> Status: Started >>> Snapshot Count: 0 >>> Number of Bricks: 1 x 2 = 2 >>> Transport-type: tcp >>> Bricks: >>> Brick1: giant1:/gluster/sdd/gv2 >>> Brick2: giant2:/gluster/sdd/gv2 >>> Options Reconfigured: >>> auth.allow: X.X.X.*,127.0.0.1 >>> cluster.granular-entry-heal: on >>> cluster.locking-scheme: granular >>> nfs.disable: on >>> >>> >>> 2016-11-30 0:10 GMT+01:00 Micha Ober < <micha2k at gmail.com> >>> micha2k at gmail.com>: >>> >>>> There also seems to be a bug report with a smiliar problem: (but no >>>> progress) >>>> <https://bugzilla.redhat.com/show_bug.cgi?id=1370683> >>>> https://bugzilla.redhat.com/show_bug.cgi?id=1370683 >>>> >>>> For me, ALL servers are affected (not isolated to one or two servers) >>>> >>>> I also see messages like "INFO: task gpu_graphene_bv:4476 blocked for >>>> more than 120 seconds." in the syslog. >>>> >>>> For completeness (gv0 is the scratch volume, gv2 the slurm volume): >>>> >>>> [root at giant2: ~]# gluster v info >>>> >>>> Volume Name: gv0 >>>> Type: Distributed-Replicate >>>> Volume ID: 993ec7c9-e4bc-44d0-b7c4-2d977e622e86 >>>> Status: Started >>>> Snapshot Count: 0 >>>> Number of Bricks: 6 x 2 = 12 >>>> Transport-type: tcp >>>> Bricks: >>>> Brick1: giant1:/gluster/sdc/gv0 >>>> Brick2: giant2:/gluster/sdc/gv0 >>>> Brick3: giant3:/gluster/sdc/gv0 >>>> Brick4: giant4:/gluster/sdc/gv0 >>>> Brick5: giant5:/gluster/sdc/gv0 >>>> Brick6: giant6:/gluster/sdc/gv0 >>>> Brick7: giant1:/gluster/sdd/gv0 >>>> Brick8: giant2:/gluster/sdd/gv0 >>>> Brick9: giant3:/gluster/sdd/gv0 >>>> Brick10: giant4:/gluster/sdd/gv0 >>>> Brick11: giant5:/gluster/sdd/gv0 >>>> Brick12: giant6:/gluster/sdd/gv0 >>>> Options Reconfigured: >>>> auth.allow: X.X.X.*,127.0.0.1 >>>> nfs.disable: on >>>> >>>> Volume Name: gv2 >>>> Type: Replicate >>>> Volume ID: 30c78928-5f2c-4671-becc-8deaee1a7a8d >>>> Status: Started >>>> Snapshot Count: 0 >>>> Number of Bricks: 1 x 2 = 2 >>>> Transport-type: tcp >>>> Bricks: >>>> Brick1: giant1:/gluster/sdd/gv2 >>>> Brick2: giant2:/gluster/sdd/gv2 >>>> Options Reconfigured: >>>> auth.allow: X.X.X.*,127.0.0.1 >>>> cluster.granular-entry-heal: on >>>> cluster.locking-scheme: granular >>>> nfs.disable: on >>>> >>>> >>>> 2016-11-29 19:21 GMT+01:00 Micha Ober < <micha2k at gmail.com> >>>> micha2k at gmail.com>: >>>> >>>>> I had opened another thread on this mailing list (Subject: "After >>>>> upgrade from 3.4.2 to 3.8.5 - High CPU usage resulting in disconnects and >>>>> split-brain"). >>>>> >>>>> The title may be a bit misleading now, as I am no longer observing >>>>> high CPU usage after upgrading to 3.8.6, but the disconnects are still >>>>> happening and the number of files in split-brain is growing. >>>>> >>>>> Setup: 6 compute nodes, each serving as a glusterfs server and client, >>>>> Ubuntu 14.04, two bricks per node, distribute-replicate >>>>> >>>>> I have two gluster volumes set up (one for scratch data, one for the >>>>> slurm scheduler). Only the scratch data volume shows critical errors "[...] >>>>> has not responded in the last 42 seconds, disconnecting.". So I can rule >>>>> out network problems, the gigabit link between the nodes is not saturated >>>>> at all. The disks are almost idle (<10%). >>>>> >>>>> I have glusterfs 3.4.2 on Ubuntu 12.04 on a another compute cluster, >>>>> running fine since it was deployed. >>>>> I had glusterfs 3.4.2 on Ubuntu 14.04 on this cluster, running fine >>>>> for almost a year. >>>>> >>>>> After upgrading to 3.8.5, the problems (as described) started. I would >>>>> like to use some of the new features of the newer versions (like bitrot), >>>>> but the users can't run their compute jobs right now because the result >>>>> files are garbled. >>>>> >>>>> 2016-11-29 18:53 GMT+01:00 Atin Mukherjee < <amukherj at redhat.com> >>>>> amukherj at redhat.com>: >>>>> >>>>>> Would you be able to share what is not working for you in 3.8.x >>>>>> (mention the exact version). 3.4 is quite old and falling back to an >>>>>> unsupported version doesn't look a feasible option. >>>>>> >>>>>> On Tue, 29 Nov 2016 at 17:01, Micha Ober < <micha2k at gmail.com> >>>>>> micha2k at gmail.com> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I was using gluster 3.4 and upgraded to 3.8, but that version showed >>>>>>> to be unusable for me. I now need to downgrade. >>>>>>> >>>>>>> I'm running Ubuntu 14.04. As upgrades of the op version >>>>>>> are irreversible, I guess I have to delete all gluster volumes and >>>>>>> re-create them with the downgraded version. >>>>>>> >>>>>>> 0. Backup data >>>>>>> 1. Unmount all gluster volumes >>>>>>> 2. apt-get purge glusterfs-server glusterfs-client >>>>>>> 3. Remove PPA for 3.8 >>>>>>> 4. Add PPA for older version >>>>>>> 5. apt-get install glusterfs-server glusterfs-client >>>>>>> 6. Create volumes >>>>>>> >>>>>>> Is "purge" enough to delete all configuration files of the currently >>>>>>> installed version or do I need to manually clear some residues before >>>>>>> installing an older version? >>>>>>> >>>>>>> Thanks. >>>>>>> _______________________________________________ >>>>>>> Gluster-users mailing list >>>>>>> <Gluster-users at gluster.org>Gluster-users at gluster.org >>>>>>> <http://www.gluster.org/mailman/listinfo/gluster-users> >>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users >>>>>> >>>>>> -- >>>>>> - Atin (atinm) >>>>>> >>>>> >>>>> >>>> >>> >>> >>> _______________________________________________ >>> Gluster-users mailing listGluster-users at gluster.orghttp://www.gluster.org/mailman/listinfo/gluster-users >>> >>> -- >> ~ Atin (atinm) >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170308/ee173428/attachment.html>