Micha Ober
2017-Mar-08 18:35 UTC
[Gluster-users] RE : Frequent connect and disconnect messages flooded in logs
?Just to let you know: I have reverted back to glusterfs 3.4.2 and everything is working again. No more disconnects, no more errors in the kernel log. So there *has* to be some kind of regression in the newer versions?. Sadly, I guess, it will be hard to find. 2016-12-20 13:31 GMT+01:00 Micha Ober <micha2k at gmail.com>:> Hi Rafi, > > here are the log files: > > NFS: http://paste.ubuntu.com/23658653/ > Brick: http://paste.ubuntu.com/23658656/ > > The brick log is of the brick which has caused the last disconnect > at 2016-12-20 06:46:36 (0-gv0-client-7). > > For completeness, here is also dmesg output: http://paste.ubuntu. > com/23658691/ > > Regards, > Micha > > 2016-12-19 7:28 GMT+01:00 Mohammed Rafi K C <rkavunga at redhat.com>: > >> Hi Micha, >> >> Sorry for the late reply. I was busy with some other things. >> >> If you have still the setup available Can you enable TRACE log level >> [1],[2] and see if you could find any log entries when the network start >> disconnecting. Basically I'm trying to find out any disconnection had >> occurred other than ping timer expire issue. >> >> >> >> [1] : gluster volume <volname> diagnostics.brick-log-level TRACE >> >> [2] : gluster volume <volname> diagnostics.client-log-level TRACE >> >> >> Regards >> >> Rafi KC >> >> On 12/08/2016 07:59 PM, Atin Mukherjee wrote: >> >> >> >> On Thu, Dec 8, 2016 at 4:37 PM, Micha Ober <micha2k at gmail.com> wrote: >> >>> Hi Rafi, >>> >>> thank you for your support. It is greatly appreciated. >>> >>> Just some more thoughts from my side: >>> >>> There have been no reports from other users in *this* thread until now, >>> but I have found at least one user with a very simiar problem in an older >>> thread: >>> >>> https://www.gluster.org/pipermail/gluster-users/2014-Novembe >>> r/019637.html >>> >>> He is also reporting disconnects with no apparent reasons, althogh his >>> setup is a bit more complicated, also involving a firewall. In our setup, >>> all servers/clients are connected via 1 GbE with no firewall or anything >>> that might block/throttle traffic. Also, we are using exactly the same >>> software versions on all nodes. >>> >>> >>> I can also find some reports in the bugtracker when searching for >>> "rpc_client_ping_timer_expired" and "rpc_clnt_ping_timer_expired" >>> (looks like spelling changed during versions). >>> >>> https://bugzilla.redhat.com/show_bug.cgi?id=1096729 >>> >> >> Just FYI, this is a different issue, here GlusterD fails to handle the >> volume of incoming requests on time since MT-epoll is not enabled here. >> >> >>> >>> https://bugzilla.redhat.com/show_bug.cgi?id=1370683 >>> >>> But both reports involve large traffic/load on the bricks/disks, which >>> is not the case for out setup. >>> To give a ballpark figure: Over three days, 30 GiB were written. And the >>> data was not written at once, but continuously over the whole time. >>> >>> >>> Just to be sure, I have checked the logfiles of one of the other >>> clusters right now, which are sitting in the same building, in the same >>> rack, even on the same switch, running the same jobs, but with glusterfs >>> 3.4.2 and I can see no disconnects in the logfiles. So I can definitely >>> rule out our infrastructure as problem. >>> >>> Regards, >>> Micha >>> >>> >>> >>> Am 07.12.2016 um 18:08 schrieb Mohammed Rafi K C: >>> >>> Hi Micha, >>> >>> This is great. I will provide you one debug build which has two fixes >>> which I possible suspect for a frequent disconnect issue, though I don't >>> have much data to validate my theory. So I will take one more day to dig in >>> to that. >>> >>> Thanks for your support, and opensource++ >>> >>> Regards >>> >>> Rafi KC >>> On 12/07/2016 05:02 AM, Micha Ober wrote: >>> >>> Hi, >>> >>> thank you for your answer and even more for the question! >>> Until now, I was using FUSE. Today I changed all mounts to NFS using the >>> same 3.7.17 version. >>> >>> But: The problem is still the same. Now, the NFS logfile contains lines >>> like these: >>> >>> [2016-12-06 15:12:29.006325] C [rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired] >>> 0-gv0-client-7: server X.X.18.62:49153 has not responded in the last 42 >>> seconds, disconnecting. >>> >>> Interestingly enough, the IP address X.X.18.62 is the same machine! As >>> I wrote earlier, each node serves both as a server and a client, as each >>> node contributes bricks to the volume. Every server is connecting to itself >>> via its hostname. For example, the fstab on the node "giant2" looks like: >>> >>> #giant2:/gv0 /shared_data glusterfs defaults,noauto 0 0 >>> #giant2:/gv2 /shared_slurm glusterfs defaults,noauto 0 0 >>> >>> giant2:/gv0 /shared_data nfs defaults,_netdev,vers=3 >>> 0 0 >>> giant2:/gv2 /shared_slurm nfs defaults,_netdev,vers=3 >>> 0 0 >>> >>> So I understand the disconnects even less. >>> >>> I don't know if it's possible to create a dummy cluster which exposes >>> the same behaviour, because the disconnects only happen when there are >>> compute jobs running on those nodes - and they are GPU compute jobs, so >>> that's something which cannot be easily emulated in a VM. >>> >>> As we have more clusters (which are running fine with an ancient 3.4 >>> version :-)) and we are currently not dependent on this particular cluster >>> (which may stay like this for this month, I think) I should be able to >>> deploy the debug build on the "real" cluster, if you can provide a debug >>> build. >>> >>> Regards and thanks, >>> Micha >>> >>> >>> >>> Am 06.12.2016 um 08:15 schrieb Mohammed Rafi K C: >>> >>> >>> >>> On 12/03/2016 12:56 AM, Micha Ober wrote: >>> >>> ** Update: ** I have downgraded from 3.8.6 to 3.7.17 now, but the >>> problem still exists. >>> >>> >>> Client log: <http://paste.ubuntu.com/23569065/>http://paste.ubuntu.com/ >>> 23569065/ >>> Brick log: <http://paste.ubuntu.com/23569067/>http://paste.ubuntu.com/ >>> 23569067/ >>> >>> Please note that each server has two bricks. >>> Whereas, according to the logs, one brick loses the connection to all >>> other hosts: >>> >>> [2016-12-02 18:38:53.703301] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.219:49121 failed (Broken pipe) >>> [2016-12-02 18:38:53.703381] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.62:49118 failed (Broken pipe) >>> [2016-12-02 18:38:53.703380] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.107:49121 failed (Broken pipe) >>> [2016-12-02 18:38:53.703424] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.206:49120 failed (Broken pipe) >>> [2016-12-02 18:38:53.703359] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.58:49121 failed (Broken pipe) >>> >>> The SECOND brick on the SAME host is NOT affected, i.e. no disconnects! >>> As I said, the network connection is fine and the disks are idle. >>> The CPU always has 2 free cores. >>> >>> It looks like I have to downgrade to 3.4 now in order for the disconnects to stop. >>> >>> >>> Hi Micha, >>> >>> Thanks for the update and sorry for what happened with gluster higher >>> versions. I can understand the need for downgrade as it is a production >>> setup. >>> >>> Can you tell me the clients used here ? whether it is a >>> fuse,nfs,nfs-ganesha, smb or libgfapi ? >>> >>> Since I'm not able to reproduce the issue (I have been trying from last >>> 3days) and the logs are not much helpful here (we don't have much logs in >>> socket layer), Could you please create a dummy cluster and try to reproduce >>> the issue? If then we can play with that volume and I could provide some >>> debug build which we can use for further debugging? >>> >>> If you don't have bandwidth for this, please leave it ;). >>> >>> Regards >>> Rafi KC >>> >>> - Micha >>> >>> >>> Am 30.11.2016 um 06:57 schrieb Mohammed Rafi K C: >>> >>> Hi Micha, >>> >>> I have changed the thread and subject so that your original thread >>> remain same for your query. Let's try to fix the problem what you observed >>> with 3.8.4, So I have started a new thread to discuss the frequent >>> disconnect problem. >>> >>> *If any one else has experienced the same problem, please respond to the >>> mail.* >>> >>> It would be very helpful if you could give us some more logs from >>> clients and bricks. Also any reproducible steps will surely help to chase >>> the problem further. >>> >>> Regards >>> >>> Rafi KC >>> On 11/30/2016 04:44 AM, Micha Ober wrote: >>> >>> I had opened another thread on this mailing list (Subject: "After >>> upgrade from 3.4.2 to 3.8.5 - High CPU usage resulting in disconnects and >>> split-brain"). >>> >>> The title may be a bit misleading now, as I am no longer observing high >>> CPU usage after upgrading to 3.8.6, but the disconnects are still happening >>> and the number of files in split-brain is growing. >>> >>> Setup: 6 compute nodes, each serving as a glusterfs server and client, >>> Ubuntu 14.04, two bricks per node, distribute-replicate >>> >>> I have two gluster volumes set up (one for scratch data, one for the >>> slurm scheduler). Only the scratch data volume shows critical errors "[...] >>> has not responded in the last 42 seconds, disconnecting.". So I can rule >>> out network problems, the gigabit link between the nodes is not saturated >>> at all. The disks are almost idle (<10%). >>> >>> I have glusterfs 3.4.2 on Ubuntu 12.04 on a another compute cluster, >>> running fine since it was deployed. >>> I had glusterfs 3.4.2 on Ubuntu 14.04 on this cluster, running fine for >>> almost a year. >>> >>> After upgrading to 3.8.5, the problems (as described) started. I would >>> like to use some of the new features of the newer versions (like bitrot), >>> but the users can't run their compute jobs right now because the result >>> files are garbled. >>> >>> There also seems to be a bug report with a smiliar problem: (but no >>> progress) >>> <https://bugzilla.redhat.com/show_bug.cgi?id=1370683> >>> https://bugzilla.redhat.com/show_bug.cgi?id=1370683 >>> >>> For me, ALL servers are affected (not isolated to one or two servers) >>> >>> I also see messages like "INFO: task gpu_graphene_bv:4476 blocked for >>> more than 120 seconds." in the syslog. >>> >>> For completeness (gv0 is the scratch volume, gv2 the slurm volume): >>> >>> [root at giant2: ~]# gluster v info >>> >>> Volume Name: gv0 >>> Type: Distributed-Replicate >>> Volume ID: 993ec7c9-e4bc-44d0-b7c4-2d977e622e86 >>> Status: Started >>> Snapshot Count: 0 >>> Number of Bricks: 6 x 2 = 12 >>> Transport-type: tcp >>> Bricks: >>> Brick1: giant1:/gluster/sdc/gv0 >>> Brick2: giant2:/gluster/sdc/gv0 >>> Brick3: giant3:/gluster/sdc/gv0 >>> Brick4: giant4:/gluster/sdc/gv0 >>> Brick5: giant5:/gluster/sdc/gv0 >>> Brick6: giant6:/gluster/sdc/gv0 >>> Brick7: giant1:/gluster/sdd/gv0 >>> Brick8: giant2:/gluster/sdd/gv0 >>> Brick9: giant3:/gluster/sdd/gv0 >>> Brick10: giant4:/gluster/sdd/gv0 >>> Brick11: giant5:/gluster/sdd/gv0 >>> Brick12: giant6:/gluster/sdd/gv0 >>> Options Reconfigured: >>> auth.allow: X.X.X.*,127.0.0.1 >>> nfs.disable: on >>> >>> Volume Name: gv2 >>> Type: Replicate >>> Volume ID: 30c78928-5f2c-4671-becc-8deaee1a7a8d >>> Status: Started >>> Snapshot Count: 0 >>> Number of Bricks: 1 x 2 = 2 >>> Transport-type: tcp >>> Bricks: >>> Brick1: giant1:/gluster/sdd/gv2 >>> Brick2: giant2:/gluster/sdd/gv2 >>> Options Reconfigured: >>> auth.allow: X.X.X.*,127.0.0.1 >>> cluster.granular-entry-heal: on >>> cluster.locking-scheme: granular >>> nfs.disable: on >>> >>> >>> 2016-11-30 0:10 GMT+01:00 Micha Ober < <micha2k at gmail.com> >>> micha2k at gmail.com>: >>> >>>> There also seems to be a bug report with a smiliar problem: (but no >>>> progress) >>>> <https://bugzilla.redhat.com/show_bug.cgi?id=1370683> >>>> https://bugzilla.redhat.com/show_bug.cgi?id=1370683 >>>> >>>> For me, ALL servers are affected (not isolated to one or two servers) >>>> >>>> I also see messages like "INFO: task gpu_graphene_bv:4476 blocked for >>>> more than 120 seconds." in the syslog. >>>> >>>> For completeness (gv0 is the scratch volume, gv2 the slurm volume): >>>> >>>> [root at giant2: ~]# gluster v info >>>> >>>> Volume Name: gv0 >>>> Type: Distributed-Replicate >>>> Volume ID: 993ec7c9-e4bc-44d0-b7c4-2d977e622e86 >>>> Status: Started >>>> Snapshot Count: 0 >>>> Number of Bricks: 6 x 2 = 12 >>>> Transport-type: tcp >>>> Bricks: >>>> Brick1: giant1:/gluster/sdc/gv0 >>>> Brick2: giant2:/gluster/sdc/gv0 >>>> Brick3: giant3:/gluster/sdc/gv0 >>>> Brick4: giant4:/gluster/sdc/gv0 >>>> Brick5: giant5:/gluster/sdc/gv0 >>>> Brick6: giant6:/gluster/sdc/gv0 >>>> Brick7: giant1:/gluster/sdd/gv0 >>>> Brick8: giant2:/gluster/sdd/gv0 >>>> Brick9: giant3:/gluster/sdd/gv0 >>>> Brick10: giant4:/gluster/sdd/gv0 >>>> Brick11: giant5:/gluster/sdd/gv0 >>>> Brick12: giant6:/gluster/sdd/gv0 >>>> Options Reconfigured: >>>> auth.allow: X.X.X.*,127.0.0.1 >>>> nfs.disable: on >>>> >>>> Volume Name: gv2 >>>> Type: Replicate >>>> Volume ID: 30c78928-5f2c-4671-becc-8deaee1a7a8d >>>> Status: Started >>>> Snapshot Count: 0 >>>> Number of Bricks: 1 x 2 = 2 >>>> Transport-type: tcp >>>> Bricks: >>>> Brick1: giant1:/gluster/sdd/gv2 >>>> Brick2: giant2:/gluster/sdd/gv2 >>>> Options Reconfigured: >>>> auth.allow: X.X.X.*,127.0.0.1 >>>> cluster.granular-entry-heal: on >>>> cluster.locking-scheme: granular >>>> nfs.disable: on >>>> >>>> >>>> 2016-11-29 19:21 GMT+01:00 Micha Ober < <micha2k at gmail.com> >>>> micha2k at gmail.com>: >>>> >>>>> I had opened another thread on this mailing list (Subject: "After >>>>> upgrade from 3.4.2 to 3.8.5 - High CPU usage resulting in disconnects and >>>>> split-brain"). >>>>> >>>>> The title may be a bit misleading now, as I am no longer observing >>>>> high CPU usage after upgrading to 3.8.6, but the disconnects are still >>>>> happening and the number of files in split-brain is growing. >>>>> >>>>> Setup: 6 compute nodes, each serving as a glusterfs server and client, >>>>> Ubuntu 14.04, two bricks per node, distribute-replicate >>>>> >>>>> I have two gluster volumes set up (one for scratch data, one for the >>>>> slurm scheduler). Only the scratch data volume shows critical errors "[...] >>>>> has not responded in the last 42 seconds, disconnecting.". So I can rule >>>>> out network problems, the gigabit link between the nodes is not saturated >>>>> at all. The disks are almost idle (<10%). >>>>> >>>>> I have glusterfs 3.4.2 on Ubuntu 12.04 on a another compute cluster, >>>>> running fine since it was deployed. >>>>> I had glusterfs 3.4.2 on Ubuntu 14.04 on this cluster, running fine >>>>> for almost a year. >>>>> >>>>> After upgrading to 3.8.5, the problems (as described) started. I would >>>>> like to use some of the new features of the newer versions (like bitrot), >>>>> but the users can't run their compute jobs right now because the result >>>>> files are garbled. >>>>> >>>>> 2016-11-29 18:53 GMT+01:00 Atin Mukherjee < <amukherj at redhat.com> >>>>> amukherj at redhat.com>: >>>>> >>>>>> Would you be able to share what is not working for you in 3.8.x >>>>>> (mention the exact version). 3.4 is quite old and falling back to an >>>>>> unsupported version doesn't look a feasible option. >>>>>> >>>>>> On Tue, 29 Nov 2016 at 17:01, Micha Ober < <micha2k at gmail.com> >>>>>> micha2k at gmail.com> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I was using gluster 3.4 and upgraded to 3.8, but that version showed >>>>>>> to be unusable for me. I now need to downgrade. >>>>>>> >>>>>>> I'm running Ubuntu 14.04. As upgrades of the op version >>>>>>> are irreversible, I guess I have to delete all gluster volumes and >>>>>>> re-create them with the downgraded version. >>>>>>> >>>>>>> 0. Backup data >>>>>>> 1. Unmount all gluster volumes >>>>>>> 2. apt-get purge glusterfs-server glusterfs-client >>>>>>> 3. Remove PPA for 3.8 >>>>>>> 4. Add PPA for older version >>>>>>> 5. apt-get install glusterfs-server glusterfs-client >>>>>>> 6. Create volumes >>>>>>> >>>>>>> Is "purge" enough to delete all configuration files of the currently >>>>>>> installed version or do I need to manually clear some residues before >>>>>>> installing an older version? >>>>>>> >>>>>>> Thanks. >>>>>>> _______________________________________________ >>>>>>> Gluster-users mailing list >>>>>>> <Gluster-users at gluster.org>Gluster-users at gluster.org >>>>>>> <http://www.gluster.org/mailman/listinfo/gluster-users> >>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users >>>>>> >>>>>> -- >>>>>> - Atin (atinm) >>>>>> >>>>> >>>>> >>>> >>> >>> >>> _______________________________________________ >>> Gluster-users mailing listGluster-users at gluster.orghttp://www.gluster.org/mailman/listinfo/gluster-users >>> >>> -- >> ~ Atin (atinm) >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170308/ee173428/attachment.html>
Amar Tumballi
2017-Mar-09 08:23 UTC
[Gluster-users] RE : Frequent connect and disconnect messages flooded in logs
----- Original Message -----> From: "Micha Ober" <micha2k at gmail.com> > > ?Just to let you know: I have reverted back to glusterfs 3.4.2 and everything > is working again. No more disconnects, no more errors in the kernel log. So > there *has* to be some kind of regression in the newer versions?. Sadly, I > guess, it will be hard to find. >Thanks for the update Micha. This helps to corner the issue a little at least. Regards, Amar> 2016-12-20 13:31 GMT+01:00 Micha Ober < micha2k at gmail.com > : > > > > Hi Rafi, > > here are the log files: > > NFS: http://paste.ubuntu.com/23658653/ > Brick: http://paste.ubuntu.com/23658656/ > > The brick log is of the brick which has caused the last disconnect at > 2016-12-20 06:46:36 (0-gv0-client-7). > > For completeness, here is also dmesg output: > http://paste.ubuntu.com/23658691/ > > Regards, > Micha > > 2016-12-19 7:28 GMT+01:00 Mohammed Rafi K C < rkavunga at redhat.com > : > > > > > > Hi Micha, > > Sorry for the late reply. I was busy with some other things. > > If you have still the setup available Can you enable TRACE log level [1],[2] > and see if you could find any log entries when the network start > disconnecting. Basically I'm trying to find out any disconnection had > occurred other than ping timer expire issue. > > > > > > > > [1] : gluster volume <volname> diagnostics.brick-log-level TRACE > > [2] : gluster volume <volname> diagnostics.client-log-level TRACE > > > > > > Regards > > Rafi KC > > On 12/08/2016 07:59 PM, Atin Mukherjee wrote: > > > > > > On Thu, Dec 8, 2016 at 4:37 PM, Micha Ober < micha2k at gmail.com > wrote: > > > > Hi Rafi, > > thank you for your support. It is greatly appreciated. > > Just some more thoughts from my side: > > There have been no reports from other users in *this* thread until now, but I > have found at least one user with a very simiar problem in an older thread: > > https://www.gluster.org/pipermail/gluster-users/2014-November/019637.html > > He is also reporting disconnects with no apparent reasons, althogh his setup > is a bit more complicated, also involving a firewall. In our setup, all > servers/clients are connected via 1 GbE with no firewall or anything that > might block/throttle traffic. Also, we are using exactly the same software > versions on all nodes. > > > I can also find some reports in the bugtracker when searching for > "rpc_client_ping_timer_expired" and "rpc_clnt_ping_timer_expired" (looks > like spelling changed during versions). > > https://bugzilla.redhat.com/show_bug.cgi?id=1096729 > > Just FYI, this is a different issue, here GlusterD fails to handle the volume > of incoming requests on time since MT-epoll is not enabled here. > > > > > > https://bugzilla.redhat.com/show_bug.cgi?id=1370683 > > But both reports involve large traffic/load on the bricks/disks, which is not > the case for out setup. > To give a ballpark figure: Over three days, 30 GiB were written. And the data > was not written at once, but continuously over the whole time. > > > Just to be sure, I have checked the logfiles of one of the other clusters > right now, which are sitting in the same building, in the same rack, even on > the same switch, running the same jobs, but with glusterfs 3.4.2 and I can > see no disconnects in the logfiles. So I can definitely rule out our > infrastructure as problem. > > Regards, > Micha > > > > Am 07.12.2016 um 18:08 schrieb Mohammed Rafi K C: > > > > > Hi Micha, > > This is great. I will provide you one debug build which has two fixes which I > possible suspect for a frequent disconnect issue, though I don't have much > data to validate my theory. So I will take one more day to dig in to that. > > Thanks for your support, and opensource++ > > Regards > > Rafi KC > On 12/07/2016 05:02 AM, Micha Ober wrote: > > > > Hi, > > thank you for your answer and even more for the question! > Until now, I was using FUSE. Today I changed all mounts to NFS using the same > 3.7.17 version. > > But: The problem is still the same. Now, the NFS logfile contains lines like > these: > > [2016-12-06 15:12:29.006325] C > [rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired] 0-gv0-client-7: server > X.X.18.62:49153 has not responded in the last 42 seconds, disconnecting. > > Interestingly enough, the IP address X.X.18.62 is the same machine! As I > wrote earlier, each node serves both as a server and a client, as each node > contributes bricks to the volume. Every server is connecting to itself via > its hostname. For example, the fstab on the node "giant2" looks like: > > #giant2:/gv0 /shared_data glusterfs defaults,noauto 0 0 > #giant2:/gv2 /shared_slurm glusterfs defaults,noauto 0 0 > > giant2:/gv0 /shared_data nfs defaults,_netdev,vers=3 0 0 > giant2:/gv2 /shared_slurm nfs defaults,_netdev,vers=3 0 0 > > So I understand the disconnects even less. > > I don't know if it's possible to create a dummy cluster which exposes the > same behaviour, because the disconnects only happen when there are compute > jobs running on those nodes - and they are GPU compute jobs, so that's > something which cannot be easily emulated in a VM. > > As we have more clusters (which are running fine with an ancient 3.4 version > :-)) and we are currently not dependent on this particular cluster (which > may stay like this for this month, I think) I should be able to deploy the > debug build on the "real" cluster, if you can provide a debug build. > > Regards and thanks, > Micha > > > > Am 06.12.2016 um 08:15 schrieb Mohammed Rafi K C: > > > > > > > On 12/03/2016 12:56 AM, Micha Ober wrote: > > > > ** Update: ** I have downgraded from 3.8.6 to 3.7.17 now, but the problem > still exists. > > > > > Client log: http://paste.ubuntu.com/ 23569065/ > Brick log: http://paste.ubuntu.com/ 23569067/ > > Please note that each server has two bricks. > Whereas, according to the logs, one brick loses the connection to all other > hosts: > [2016-12-02 18:38:53.703301] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: > writev on X.X.X.219:49121 failed (Broken pipe) > [2016-12-02 18:38:53.703381] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: > writev on X.X.X.62:49118 failed (Broken pipe) > [2016-12-02 18:38:53.703380] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: > writev on X.X.X.107:49121 failed (Broken pipe) > [2016-12-02 18:38:53.703424] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: > writev on X.X.X.206:49120 failed (Broken pipe) > [2016-12-02 18:38:53.703359] W [socket.c:596:__socket_rwv] 0-tcp.gv0-server: > writev on X.X.X.58:49121 failed (Broken pipe) > > The SECOND brick on the SAME host is NOT affected, i.e. no disconnects! > As I said, the network connection is fine and the disks are idle. > The CPU always has 2 free cores. > > It looks like I have to downgrade to 3.4 now in order for the disconnects to > stop. > > Hi Micha, > > Thanks for the update and sorry for what happened with gluster higher > versions. I can understand the need for downgrade as it is a production > setup. > > Can you tell me the clients used here ? whether it is a fuse,nfs,nfs-ganesha, > smb or libgfapi ? > > Since I'm not able to reproduce the issue (I have been trying from last > 3days) and the logs are not much helpful here (we don't have much logs in > socket layer), Could you please create a dummy cluster and try to reproduce > the issue? If then we can play with that volume and I could provide some > debug build which we can use for further debugging? > > If you don't have bandwidth for this, please leave it ;). > > Regards > Rafi KC > > > > > - Micha > > Am 30.11.2016 um 06:57 schrieb Mohammed Rafi K C: > > > > > Hi Micha, > > I have changed the thread and subject so that your original thread remain > same for your query. Let's try to fix the problem what you observed with > 3.8.4, So I have started a new thread to discuss the frequent disconnect > problem. > > If any one else has experienced the same problem, please respond to the mail. > > > It would be very helpful if you could give us some more logs from clients and > bricks. Also any reproducible steps will surely help to chase the problem > further. > > Regards > > Rafi KC > On 11/30/2016 04:44 AM, Micha Ober wrote: > > > > I had opened another thread on this mailing list (Subject: "After upgrade > from 3.4.2 to 3.8.5 - High CPU usage resulting in disconnects and > split-brain"). > > The title may be a bit misleading now, as I am no longer observing high CPU > usage after upgrading to 3.8.6, but the disconnects are still happening and > the number of files in split-brain is growing. > > Setup: 6 compute nodes, each serving as a glusterfs server and client, Ubuntu > 14.04, two bricks per node, distribute-replicate > > I have two gluster volumes set up (one for scratch data, one for the slurm > scheduler). Only the scratch data volume shows critical errors "[...] has > not responded in the last 42 seconds, disconnecting.". So I can rule out > network problems, the gigabit link between the nodes is not saturated at > all. The disks are almost idle (<10%). > > I have glusterfs 3.4.2 on Ubuntu 12.04 on a another compute cluster, running > fine since it was deployed. > I had glusterfs 3.4.2 on Ubuntu 14.04 on this cluster, running fine for > almost a year. > > After upgrading to 3.8.5, the problems (as described) started. I would like > to use some of the new features of the newer versions (like bitrot), but the > users can't run their compute jobs right now because the result files are > garbled. > > There also seems to be a bug report with a smiliar problem: (but no progress) > https://bugzilla.redhat.com/ show_bug.cgi?id=1370683 > > For me, ALL servers are affected (not isolated to one or two servers) > > I also see messages like "INFO: task gpu_graphene_bv:4476 blocked for more > than 120 seconds." in the syslog. > > For completeness (gv0 is the scratch volume, gv2 the slurm volume): > > [root at giant2: ~]# gluster v info > > Volume Name: gv0 > Type: Distributed-Replicate > Volume ID: 993ec7c9-e4bc-44d0-b7c4-2d977e622e86 > Status: Started > Snapshot Count: 0 > Number of Bricks: 6 x 2 = 12 > Transport-type: tcp > Bricks: > Brick1: giant1:/gluster/sdc/gv0 > Brick2: giant2:/gluster/sdc/gv0 > Brick3: giant3:/gluster/sdc/gv0 > Brick4: giant4:/gluster/sdc/gv0 > Brick5: giant5:/gluster/sdc/gv0 > Brick6: giant6:/gluster/sdc/gv0 > Brick7: giant1:/gluster/sdd/gv0 > Brick8: giant2:/gluster/sdd/gv0 > Brick9: giant3:/gluster/sdd/gv0 > Brick10: giant4:/gluster/sdd/gv0 > Brick11: giant5:/gluster/sdd/gv0 > Brick12: giant6:/gluster/sdd/gv0 > Options Reconfigured: > auth.allow: X.X.X.*,127.0.0.1 > nfs.disable: on > > Volume Name: gv2 > Type: Replicate > Volume ID: 30c78928-5f2c-4671-becc-8deaee1a7a8d > Status: Started > Snapshot Count: 0 > Number of Bricks: 1 x 2 = 2 > Transport-type: tcp > Bricks: > Brick1: giant1:/gluster/sdd/gv2 > Brick2: giant2:/gluster/sdd/gv2 > Options Reconfigured: > auth.allow: X.X.X.*,127.0.0.1 > cluster.granular-entry-heal: on > cluster.locking-scheme: granular > nfs.disable: on > > > 2016-11-30 0:10 GMT+01:00 Micha Ober < micha2k at gmail.com > : > > > > There also seems to be a bug report with a smiliar problem: (but no progress) > https://bugzilla.redhat.com/sh ow_bug.cgi?id=1370683 > > For me, ALL servers are affected (not isolated to one or two servers) > > I also see messages like "INFO: task gpu_graphene_bv:4476 blocked for more > than 120 seconds." in the syslog. > > For completeness (gv0 is the scratch volume, gv2 the slurm volume): > > [root at giant2: ~]# gluster v info > > Volume Name: gv0 > Type: Distributed-Replicate > Volume ID: 993ec7c9-e4bc-44d0-b7c4-2d977e622e86 > Status: Started > Snapshot Count: 0 > Number of Bricks: 6 x 2 = 12 > Transport-type: tcp > Bricks: > Brick1: giant1:/gluster/sdc/gv0 > Brick2: giant2:/gluster/sdc/gv0 > Brick3: giant3:/gluster/sdc/gv0 > Brick4: giant4:/gluster/sdc/gv0 > Brick5: giant5:/gluster/sdc/gv0 > Brick6: giant6:/gluster/sdc/gv0 > Brick7: giant1:/gluster/sdd/gv0 > Brick8: giant2:/gluster/sdd/gv0 > Brick9: giant3:/gluster/sdd/gv0 > Brick10: giant4:/gluster/sdd/gv0 > Brick11: giant5:/gluster/sdd/gv0 > Brick12: giant6:/gluster/sdd/gv0 > Options Reconfigured: > auth.allow: X.X.X.*,127.0.0.1 > nfs.disable: on > > Volume Name: gv2 > Type: Replicate > Volume ID: 30c78928-5f2c-4671-becc-8deaee1a7a8d > Status: Started > Snapshot Count: 0 > Number of Bricks: 1 x 2 = 2 > Transport-type: tcp > Bricks: > Brick1: giant1:/gluster/sdd/gv2 > Brick2: giant2:/gluster/sdd/gv2 > Options Reconfigured: > auth.allow: X.X.X.*,127.0.0.1 > cluster.granular-entry-heal: on > cluster.locking-scheme: granular > nfs.disable: on > > > 2016-11-29 19:21 GMT+01:00 Micha Ober < micha2k at gmail.com > : > > > > I had opened another thread on this mailing list (Subject: "After upgrade > from 3.4.2 to 3.8.5 - High CPU usage resulting in disconnects and > split-brain"). > > The title may be a bit misleading now, as I am no longer observing high CPU > usage after upgrading to 3.8.6, but the disconnects are still happening and > the number of files in split-brain is growing. > > Setup: 6 compute nodes, each serving as a glusterfs server and client, Ubuntu > 14.04, two bricks per node, distribute-replicate > > I have two gluster volumes set up (one for scratch data, one for the slurm > scheduler). Only the scratch data volume shows critical errors "[...] has > not responded in the last 42 seconds, disconnecting.". So I can rule out > network problems, the gigabit link between the nodes is not saturated at > all. The disks are almost idle (<10%). > > I have glusterfs 3.4.2 on Ubuntu 12.04 on a another compute cluster, running > fine since it was deployed. > I had glusterfs 3.4.2 on Ubuntu 14.04 on this cluster, running fine for > almost a year. > > After upgrading to 3.8.5, the problems (as described) started. I would like > to use some of the new features of the newer versions (like bitrot), but the > users can't run their compute jobs right now because the result files are > garbled. > > 2016-11-29 18:53 GMT+01:00 Atin Mukherjee < amukherj at redhat.com > : > > > > Would you be able to share what is not working for you in 3.8.x (mention the > exact version). 3.4 is quite old and falling back to an unsupported version > doesn't look a feasible option. > > On Tue, 29 Nov 2016 at 17:01, Micha Ober < micha2k at gmail.com > wrote: > > > > Hi, > > I was using gluster 3.4 and upgraded to 3.8, but that version showed to be > unusable for me. I now need to downgrade. > > I'm running Ubuntu 14.04. As upgrades of the op version are irreversible, I > guess I have to delete all gluster volumes and re-create them with the > downgraded version. > > 0. Backup data > 1. Unmount all gluster volumes > 2. apt-get purge glusterfs-server glusterfs-client > 3. Remove PPA for 3.8 > 4. Add PPA for older version > 5. apt-get install glusterfs-server glusterfs-client > 6. Create volumes > > Is "purge" enough to delete all configuration files of the currently > installed version or do I need to manually clear some residues before > installing an older version? > > Thanks. > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://www.gluster.org/mailman /listinfo/gluster-users > -- > - Atin (atinm) > > > > > > _______________________________________________ > Gluster-users mailing list Gluster-users at gluster.org > http://www.gluster.org/mailman/listinfo/gluster-users > > > > > > > > -- > ~ Atin (atinm) > > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://lists.gluster.org/mailman/listinfo/gluster-users