thr3ads.net - Gluster users - [Gluster-users] RE : Frequent connect and disconnect messages flooded in logs [Mar 2017]

If this information is useful, please help other people find it:
Share via:

Micha Ober

2017-Mar-08 18:35 UTC

[Gluster-users] RE : Frequent connect and disconnect messages flooded in logs

?Just to let you know: I have reverted back to glusterfs 3.4.2 and
everything is working again. No more disconnects, no more errors in the
kernel log. So there *has* to be some kind of regression in the newer
versions?. Sadly, I guess, it will be hard to find.

2016-12-20 13:31 GMT+01:00 Micha Ober <micha2k at gmail.com>:
> Hi Rafi,
>
> here are the log files:
>
> NFS: http://paste.ubuntu.com/23658653/
> Brick: http://paste.ubuntu.com/23658656/
>
> The brick log is of the brick which has caused the last disconnect
> at 2016-12-20 06:46:36 (0-gv0-client-7).
>
> For completeness, here is also dmesg output: http://paste.ubuntu.
> com/23658691/
>
> Regards,
> Micha
>
> 2016-12-19 7:28 GMT+01:00 Mohammed Rafi K C <rkavunga at redhat.com>:
>
>> Hi Micha,
>>
>> Sorry for the late reply. I was busy with some other things.
>>
>> If you have still the setup available Can you enable TRACE log level
>> [1],[2] and see if you could find any log entries when the network
start
>> disconnecting. Basically I'm trying to find out any disconnection
had
>> occurred other than ping timer expire issue.
>>
>>
>>
>> [1] : gluster volume <volname> diagnostics.brick-log-level TRACE
>>
>> [2] : gluster volume <volname> diagnostics.client-log-level TRACE
>>
>>
>> Regards
>>
>> Rafi KC
>>
>> On 12/08/2016 07:59 PM, Atin Mukherjee wrote:
>>
>>
>>
>> On Thu, Dec 8, 2016 at 4:37 PM, Micha Ober <micha2k at gmail.com>
wrote:
>>
>>> Hi Rafi,
>>>
>>> thank you for your support. It is greatly appreciated.
>>>
>>> Just some more thoughts from my side:
>>>
>>> There have been no reports from other  users in *this* thread until
now,
>>> but I have found at least one user with a very simiar problem in an
older
>>> thread:
>>>
>>> https://www.gluster.org/pipermail/gluster-users/2014-Novembe
>>> r/019637.html
>>>
>>> He is also reporting disconnects  with no apparent reasons, althogh
his
>>> setup is a bit more complicated, also involving a firewall. In our
setup,
>>> all servers/clients are connected via 1 GbE with no firewall or
anything
>>> that might block/throttle traffic. Also, we are using exactly the
same
>>> software versions on all nodes.
>>>
>>>
>>> I can also find some reports in the bugtracker when searching for
>>> "rpc_client_ping_timer_expired" and
"rpc_clnt_ping_timer_expired"
>>> (looks like spelling changed during versions).
>>>
>>> https://bugzilla.redhat.com/show_bug.cgi?id=1096729
>>>
>>
>> Just FYI, this is a different issue, here GlusterD fails to handle the
>> volume of incoming requests on time since MT-epoll is not enabled here.
>>
>>
>>>
>>> https://bugzilla.redhat.com/show_bug.cgi?id=1370683
>>>
>>> But both reports involve large traffic/load on the bricks/disks,
which
>>> is not the case for out setup.
>>> To give a ballpark figure: Over three days, 30 GiB were written.
And the
>>> data was not written at once, but continuously over the whole time.
>>>
>>>
>>> Just to be sure, I have checked the logfiles of one of the other
>>> clusters right now, which are sitting in the same building, in the
same
>>> rack, even on the same switch, running the same jobs, but with
glusterfs
>>> 3.4.2 and I can see no disconnects in the logfiles. So I can
definitely
>>> rule out our infrastructure as problem.
>>>
>>> Regards,
>>> Micha
>>>
>>>
>>>
>>> Am 07.12.2016 um 18:08 schrieb Mohammed Rafi K C:
>>>
>>> Hi Micha,
>>>
>>> This is great. I will provide you one debug build which has two
fixes
>>> which I possible suspect for a frequent disconnect issue, though I
don't
>>> have much data to validate my theory. So I will take one more day
to dig in
>>> to that.
>>>
>>> Thanks for your support, and opensource++
>>>
>>> Regards
>>>
>>> Rafi KC
>>> On 12/07/2016 05:02 AM, Micha Ober wrote:
>>>
>>> Hi,
>>>
>>> thank you for your answer and even more for the question!
>>> Until now, I was using FUSE. Today I changed all mounts to NFS
using the
>>> same 3.7.17 version.
>>>
>>> But: The problem is still the same. Now, the NFS logfile contains
lines
>>> like these:
>>>
>>> [2016-12-06 15:12:29.006325] C
[rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired]
>>> 0-gv0-client-7: server X.X.18.62:49153 has not responded in the
last 42
>>> seconds, disconnecting.
>>>
>>> Interestingly enough,  the IP address X.X.18.62 is the same
machine! As
>>> I wrote earlier, each node serves both as a server and a client, as
each
>>> node contributes bricks to the volume. Every server is connecting
to itself
>>> via its hostname. For example, the fstab on the node
"giant2" looks like:
>>>
>>> #giant2:/gv0    /shared_data    glusterfs       defaults,noauto 0  
0
>>> #giant2:/gv2    /shared_slurm   glusterfs       defaults,noauto 0  
0
>>>
>>> giant2:/gv0     /shared_data    nfs            
defaults,_netdev,vers=3
>>> 0       0
>>> giant2:/gv2     /shared_slurm   nfs            
defaults,_netdev,vers=3
>>> 0       0
>>>
>>> So I understand the disconnects even less.
>>>
>>> I don't know if it's possible to create a dummy cluster
which exposes
>>> the same behaviour, because the disconnects only happen when there
are
>>> compute jobs running on those nodes - and they are GPU compute
jobs, so
>>> that's something which cannot be easily emulated in a VM.
>>>
>>> As we have more clusters (which are running fine with an ancient
3.4
>>> version :-)) and we are currently not dependent on this particular
cluster
>>> (which may stay like this for this month, I think) I should be able
to
>>> deploy the debug build on the "real" cluster, if you can
provide a debug
>>> build.
>>>
>>> Regards and thanks,
>>> Micha
>>>
>>>
>>>
>>> Am 06.12.2016 um 08:15 schrieb Mohammed Rafi K C:
>>>
>>>
>>>
>>> On 12/03/2016 12:56 AM, Micha Ober wrote:
>>>
>>> ** Update: ** I have downgraded from 3.8.6 to 3.7.17 now, but the
>>> problem still exists.
>>>
>>>
>>> Client log:
<http://paste.ubuntu.com/23569065/>http://paste.ubuntu.com/
>>> 23569065/
>>> Brick log:
<http://paste.ubuntu.com/23569067/>http://paste.ubuntu.com/
>>> 23569067/
>>>
>>> Please note that each server has two bricks.
>>> Whereas, according to the logs, one brick loses the connection to
all
>>> other hosts:
>>>
>>> [2016-12-02 18:38:53.703301] W [socket.c:596:__socket_rwv]
0-tcp.gv0-server: writev on X.X.X.219:49121 failed (Broken pipe)
>>> [2016-12-02 18:38:53.703381] W [socket.c:596:__socket_rwv]
0-tcp.gv0-server: writev on X.X.X.62:49118 failed (Broken pipe)
>>> [2016-12-02 18:38:53.703380] W [socket.c:596:__socket_rwv]
0-tcp.gv0-server: writev on X.X.X.107:49121 failed (Broken pipe)
>>> [2016-12-02 18:38:53.703424] W [socket.c:596:__socket_rwv]
0-tcp.gv0-server: writev on X.X.X.206:49120 failed (Broken pipe)
>>> [2016-12-02 18:38:53.703359] W [socket.c:596:__socket_rwv]
0-tcp.gv0-server: writev on X.X.X.58:49121 failed (Broken pipe)
>>>
>>> The SECOND brick on the SAME host is NOT affected, i.e. no
disconnects!
>>> As I said, the network connection is fine and the disks are idle.
>>> The CPU always has 2 free cores.
>>>
>>> It looks like I have to downgrade to 3.4 now in order for the
disconnects to stop.
>>>
>>>
>>> Hi Micha,
>>>
>>> Thanks for the update and sorry for what happened with gluster
higher
>>> versions. I can understand the need for downgrade as it is a
production
>>> setup.
>>>
>>> Can you tell me the clients used here ? whether it is a
>>> fuse,nfs,nfs-ganesha, smb or libgfapi ?
>>>
>>> Since I'm not able to reproduce the issue (I have been trying
from last
>>> 3days) and the logs are not much helpful here (we don't have
much logs in
>>> socket layer), Could you please create a dummy cluster and try to
reproduce
>>> the issue? If then we can play with that volume and I could provide
some
>>> debug build which we can use for further debugging?
>>>
>>> If you don't have bandwidth for this, please leave it ;).
>>>
>>> Regards
>>> Rafi KC
>>>
>>> - Micha
>>>
>>>
>>> Am 30.11.2016 um 06:57 schrieb Mohammed Rafi K C:
>>>
>>> Hi Micha,
>>>
>>> I have changed the thread and subject so that your original thread
>>> remain same for your query. Let's try to fix the problem what
you observed
>>> with 3.8.4, So I have started a new thread to discuss the frequent
>>> disconnect problem.
>>>
>>> *If any one else has experienced the same problem, please respond
to the
>>> mail.*
>>>
>>> It would be very helpful if you could give us some more logs from
>>> clients and bricks.  Also any reproducible steps will surely help
to chase
>>> the problem further.
>>>
>>> Regards
>>>
>>> Rafi KC
>>> On 11/30/2016 04:44 AM, Micha Ober wrote:
>>>
>>> I had opened another thread on this mailing list (Subject:
"After
>>> upgrade from 3.4.2 to 3.8.5 - High CPU usage resulting in
disconnects and
>>> split-brain").
>>>
>>> The title may be a bit misleading now, as I am no longer observing
high
>>> CPU usage after upgrading to 3.8.6, but the disconnects are still
happening
>>> and the number of files in split-brain is growing.
>>>
>>> Setup: 6 compute nodes, each serving as a glusterfs server and
client,
>>> Ubuntu 14.04, two bricks per node, distribute-replicate
>>>
>>> I have two gluster volumes set up (one for scratch data, one for
the
>>> slurm scheduler). Only the scratch data volume shows critical
errors "[...]
>>> has not responded in the last 42 seconds, disconnecting.". So
I can rule
>>> out network problems, the gigabit link between the nodes is not
saturated
>>> at all. The disks are almost idle (<10%).
>>>
>>> I have glusterfs 3.4.2 on Ubuntu 12.04 on a another compute
cluster,
>>> running fine since it was deployed.
>>> I had glusterfs 3.4.2 on Ubuntu 14.04 on this cluster, running fine
for
>>> almost a year.
>>>
>>> After upgrading to 3.8.5, the problems (as described) started. I
would
>>> like to use some of the new features of the newer versions (like
bitrot),
>>> but the users can't run their compute jobs right now because
the result
>>> files are garbled.
>>>
>>> There also seems to be a bug report with a smiliar problem: (but no
>>> progress)
>>> <https://bugzilla.redhat.com/show_bug.cgi?id=1370683>
>>> https://bugzilla.redhat.com/show_bug.cgi?id=1370683
>>>
>>> For me, ALL servers are affected (not isolated to one or two
servers)
>>>
>>> I also see messages like "INFO: task gpu_graphene_bv:4476
blocked for
>>> more than 120 seconds." in the syslog.
>>>
>>> For completeness (gv0 is the scratch volume, gv2 the slurm volume):
>>>
>>> [root at giant2: ~]# gluster v info
>>>
>>> Volume Name: gv0
>>> Type: Distributed-Replicate
>>> Volume ID: 993ec7c9-e4bc-44d0-b7c4-2d977e622e86
>>> Status: Started
>>> Snapshot Count: 0
>>> Number of Bricks: 6 x 2 = 12
>>> Transport-type: tcp
>>> Bricks:
>>> Brick1: giant1:/gluster/sdc/gv0
>>> Brick2: giant2:/gluster/sdc/gv0
>>> Brick3: giant3:/gluster/sdc/gv0
>>> Brick4: giant4:/gluster/sdc/gv0
>>> Brick5: giant5:/gluster/sdc/gv0
>>> Brick6: giant6:/gluster/sdc/gv0
>>> Brick7: giant1:/gluster/sdd/gv0
>>> Brick8: giant2:/gluster/sdd/gv0
>>> Brick9: giant3:/gluster/sdd/gv0
>>> Brick10: giant4:/gluster/sdd/gv0
>>> Brick11: giant5:/gluster/sdd/gv0
>>> Brick12: giant6:/gluster/sdd/gv0
>>> Options Reconfigured:
>>> auth.allow: X.X.X.*,127.0.0.1
>>> nfs.disable: on
>>>
>>> Volume Name: gv2
>>> Type: Replicate
>>> Volume ID: 30c78928-5f2c-4671-becc-8deaee1a7a8d
>>> Status: Started
>>> Snapshot Count: 0
>>> Number of Bricks: 1 x 2 = 2
>>> Transport-type: tcp
>>> Bricks:
>>> Brick1: giant1:/gluster/sdd/gv2
>>> Brick2: giant2:/gluster/sdd/gv2
>>> Options Reconfigured:
>>> auth.allow: X.X.X.*,127.0.0.1
>>> cluster.granular-entry-heal: on
>>> cluster.locking-scheme: granular
>>> nfs.disable: on
>>>
>>>
>>> 2016-11-30 0:10 GMT+01:00 Micha Ober < <micha2k at
gmail.com>
>>> micha2k at gmail.com>:
>>>
>>>> There also seems to be a bug report with a smiliar problem:
(but no
>>>> progress)
>>>> <https://bugzilla.redhat.com/show_bug.cgi?id=1370683>
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1370683
>>>>
>>>> For me, ALL servers are affected (not isolated to one or two
servers)
>>>>
>>>> I also see messages like "INFO: task gpu_graphene_bv:4476
blocked for
>>>> more than 120 seconds." in the syslog.
>>>>
>>>> For completeness (gv0 is the scratch volume, gv2 the slurm
volume):
>>>>
>>>> [root at giant2: ~]# gluster v info
>>>>
>>>> Volume Name: gv0
>>>> Type: Distributed-Replicate
>>>> Volume ID: 993ec7c9-e4bc-44d0-b7c4-2d977e622e86
>>>> Status: Started
>>>> Snapshot Count: 0
>>>> Number of Bricks: 6 x 2 = 12
>>>> Transport-type: tcp
>>>> Bricks:
>>>> Brick1: giant1:/gluster/sdc/gv0
>>>> Brick2: giant2:/gluster/sdc/gv0
>>>> Brick3: giant3:/gluster/sdc/gv0
>>>> Brick4: giant4:/gluster/sdc/gv0
>>>> Brick5: giant5:/gluster/sdc/gv0
>>>> Brick6: giant6:/gluster/sdc/gv0
>>>> Brick7: giant1:/gluster/sdd/gv0
>>>> Brick8: giant2:/gluster/sdd/gv0
>>>> Brick9: giant3:/gluster/sdd/gv0
>>>> Brick10: giant4:/gluster/sdd/gv0
>>>> Brick11: giant5:/gluster/sdd/gv0
>>>> Brick12: giant6:/gluster/sdd/gv0
>>>> Options Reconfigured:
>>>> auth.allow: X.X.X.*,127.0.0.1
>>>> nfs.disable: on
>>>>
>>>> Volume Name: gv2
>>>> Type: Replicate
>>>> Volume ID: 30c78928-5f2c-4671-becc-8deaee1a7a8d
>>>> Status: Started
>>>> Snapshot Count: 0
>>>> Number of Bricks: 1 x 2 = 2
>>>> Transport-type: tcp
>>>> Bricks:
>>>> Brick1: giant1:/gluster/sdd/gv2
>>>> Brick2: giant2:/gluster/sdd/gv2
>>>> Options Reconfigured:
>>>> auth.allow: X.X.X.*,127.0.0.1
>>>> cluster.granular-entry-heal: on
>>>> cluster.locking-scheme: granular
>>>> nfs.disable: on
>>>>
>>>>
>>>> 2016-11-29 19:21 GMT+01:00 Micha Ober < <micha2k at
gmail.com>
>>>> micha2k at gmail.com>:
>>>>
>>>>> I had opened another thread on this mailing list (Subject:
"After
>>>>> upgrade from 3.4.2 to 3.8.5 - High CPU usage resulting in
disconnects and
>>>>> split-brain").
>>>>>
>>>>> The title may be a bit misleading now, as I am no longer
observing
>>>>> high CPU usage after upgrading to 3.8.6, but the
disconnects are still
>>>>> happening and the number of files in split-brain is
growing.
>>>>>
>>>>> Setup: 6 compute nodes, each serving as a glusterfs server
and client,
>>>>> Ubuntu 14.04, two bricks per node, distribute-replicate
>>>>>
>>>>> I have two gluster volumes set up (one for scratch data,
one for the
>>>>> slurm scheduler). Only the scratch data volume shows
critical errors "[...]
>>>>> has not responded in the last 42 seconds,
disconnecting.". So I can rule
>>>>> out network problems, the gigabit link between the nodes is
not saturated
>>>>> at all. The disks are almost idle (<10%).
>>>>>
>>>>> I have glusterfs 3.4.2 on Ubuntu 12.04 on a another compute
cluster,
>>>>> running fine since it was deployed.
>>>>> I had glusterfs 3.4.2 on Ubuntu 14.04 on this cluster,
running fine
>>>>> for almost a year.
>>>>>
>>>>> After upgrading to 3.8.5, the problems (as described)
started. I would
>>>>> like to use some of the new features of the newer versions
(like bitrot),
>>>>> but the users can't run their compute jobs right now
because the result
>>>>> files are garbled.
>>>>>
>>>>> 2016-11-29 18:53 GMT+01:00 Atin Mukherjee < <amukherj
at redhat.com>
>>>>> amukherj at redhat.com>:
>>>>>
>>>>>> Would you be able to share what is not working for you
in 3.8.x
>>>>>> (mention the exact version). 3.4 is quite old and
falling back to an
>>>>>> unsupported version doesn't look a feasible option.
>>>>>>
>>>>>> On Tue, 29 Nov 2016 at 17:01, Micha Ober <
<micha2k at gmail.com>
>>>>>> micha2k at gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I was using gluster 3.4 and upgraded to 3.8, but
that version showed
>>>>>>> to be unusable for me. I now need to downgrade.
>>>>>>>
>>>>>>> I'm running Ubuntu 14.04. As upgrades of the op
version
>>>>>>> are irreversible, I guess I have to delete all
gluster volumes and
>>>>>>> re-create them with the downgraded version.
>>>>>>>
>>>>>>> 0. Backup data
>>>>>>> 1. Unmount all gluster volumes
>>>>>>> 2. apt-get purge glusterfs-server glusterfs-client
>>>>>>> 3. Remove PPA for 3.8
>>>>>>> 4. Add PPA for older version
>>>>>>> 5. apt-get install glusterfs-server
glusterfs-client
>>>>>>> 6. Create volumes
>>>>>>>
>>>>>>> Is "purge" enough to delete all
configuration files of the currently
>>>>>>> installed version or do I need to  manually clear
some residues before
>>>>>>> installing an older version?
>>>>>>>
>>>>>>> Thanks.
>>>>>>> _______________________________________________
>>>>>>> Gluster-users mailing list
>>>>>>> <Gluster-users at gluster.org>Gluster-users
at gluster.org
>>>>>>>
<http://www.gluster.org/mailman/listinfo/gluster-users>
>>>>>>>
http://www.gluster.org/mailman/listinfo/gluster-users
>>>>>>
>>>>>> --
>>>>>> - Atin (atinm)
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> Gluster-users mailing listGluster-users at
gluster.orghttp://www.gluster.org/mailman/listinfo/gluster-users
>>>
>>> --
>> ~ Atin (atinm)
>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20170308/ee173428/attachment.html>

Amar Tumballi

2017-Mar-09 08:23 UTC

head link

[Gluster-users] RE : Frequent connect and disconnect messages flooded in logs

----- Original Message -----> From: "Micha Ober" <micha2k at gmail.com>
> 
> ?Just to let you know: I have reverted back to glusterfs 3.4.2 and
everything
> is working again. No more disconnects, no more errors in the kernel log. So
> there *has* to be some kind of regression in the newer versions?. Sadly, I
> guess, it will be hard to find.
> 
Thanks for the update Micha. This helps to corner the issue a little at least.

Regards,
Amar

> 2016-12-20 13:31 GMT+01:00 Micha Ober < micha2k at gmail.com > :
> 
> 
> 
> Hi Rafi,
> 
> here are the log files:
> 
> NFS: http://paste.ubuntu.com/23658653/
> Brick: http://paste.ubuntu.com/23658656/
> 
> The brick log is of the brick which has caused the last disconnect at
> 2016-12-20 06:46:36 (0-gv0-client-7).
> 
> For completeness, here is also dmesg output:
> http://paste.ubuntu.com/23658691/
> 
> Regards,
> Micha
> 
> 2016-12-19 7:28 GMT+01:00 Mohammed Rafi K C < rkavunga at redhat.com
> :
> 
> 
> 
> 
> 
> Hi Micha,
> 
> Sorry for the late reply. I was busy with some other things.
> 
> If you have still the setup available Can you enable TRACE log level
[1],[2]
> and see if you could find any log entries when the network start
> disconnecting. Basically I'm trying to find out any disconnection had
> occurred other than ping timer expire issue.
> 
> 
> 
> 
> 
> 
> 
> [1] : gluster volume <volname> diagnostics.brick-log-level TRACE
> 
> [2] : gluster volume <volname> diagnostics.client-log-level TRACE
> 
> 
> 
> 
> 
> Regards
> 
> Rafi KC
> 
> On 12/08/2016 07:59 PM, Atin Mukherjee wrote:
> 
> 
> 
> 
> 
> On Thu, Dec 8, 2016 at 4:37 PM, Micha Ober < micha2k at gmail.com >
wrote:
> 
> 
> 
> Hi Rafi,
> 
> thank you for your support. It is greatly appreciated.
> 
> Just some more thoughts from my side:
> 
> There have been no reports from other users in *this* thread until now, but
I
> have found at least one user with a very simiar problem in an older thread:
> 
> https://www.gluster.org/pipermail/gluster-users/2014-November/019637.html
> 
> He is also reporting disconnects with no apparent reasons, althogh his
setup
> is a bit more complicated, also involving a firewall. In our setup, all
> servers/clients are connected via 1 GbE with no firewall or anything that
> might block/throttle traffic. Also, we are using exactly the same software
> versions on all nodes.
> 
> 
> I can also find some reports in the bugtracker when searching for
> "rpc_client_ping_timer_expired" and
"rpc_clnt_ping_timer_expired" (looks
> like spelling changed during versions).
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1096729
> 
> Just FYI, this is a different issue, here GlusterD fails to handle the
volume
> of incoming requests on time since MT-epoll is not enabled here.
> 
> 
> 
> 
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1370683
> 
> But both reports involve large traffic/load on the bricks/disks, which is
not
> the case for out setup.
> To give a ballpark figure: Over three days, 30 GiB were written. And the
data
> was not written at once, but continuously over the whole time.
> 
> 
> Just to be sure, I have checked the logfiles of one of the other clusters
> right now, which are sitting in the same building, in the same rack, even
on
> the same switch, running the same jobs, but with glusterfs 3.4.2 and I can
> see no disconnects in the logfiles. So I can definitely rule out our
> infrastructure as problem.
> 
> Regards,
> Micha
> 
> 
> 
> Am 07.12.2016 um 18:08 schrieb Mohammed Rafi K C:
> 
> 
> 
> 
> Hi Micha,
> 
> This is great. I will provide you one debug build which has two fixes which
I
> possible suspect for a frequent disconnect issue, though I don't have
much
> data to validate my theory. So I will take one more day to dig in to that.
> 
> Thanks for your support, and opensource++
> 
> Regards
> 
> Rafi KC
> On 12/07/2016 05:02 AM, Micha Ober wrote:
> 
> 
> 
> Hi,
> 
> thank you for your answer and even more for the question!
> Until now, I was using FUSE. Today I changed all mounts to NFS using the
same
> 3.7.17 version.
> 
> But: The problem is still the same. Now, the NFS logfile contains lines
like
> these:
> 
> [2016-12-06 15:12:29.006325] C
> [rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired] 0-gv0-client-7: server
> X.X.18.62:49153 has not responded in the last 42 seconds, disconnecting.
> 
> Interestingly enough, the IP address X.X.18.62 is the same machine! As I
> wrote earlier, each node serves both as a server and a client, as each node
> contributes bricks to the volume. Every server is connecting to itself via
> its hostname. For example, the fstab on the node "giant2" looks
like:
> 
> #giant2:/gv0 /shared_data glusterfs defaults,noauto 0 0
> #giant2:/gv2 /shared_slurm glusterfs defaults,noauto 0 0
> 
> giant2:/gv0 /shared_data nfs defaults,_netdev,vers=3 0 0
> giant2:/gv2 /shared_slurm nfs defaults,_netdev,vers=3 0 0
> 
> So I understand the disconnects even less.
> 
> I don't know if it's possible to create a dummy cluster which
exposes the
> same behaviour, because the disconnects only happen when there are compute
> jobs running on those nodes - and they are GPU compute jobs, so that's
> something which cannot be easily emulated in a VM.
> 
> As we have more clusters (which are running fine with an ancient 3.4
version
> :-)) and we are currently not dependent on this particular cluster (which
> may stay like this for this month, I think) I should be able to deploy the
> debug build on the "real" cluster, if you can provide a debug
build.
> 
> Regards and thanks,
> Micha
> 
> 
> 
> Am 06.12.2016 um 08:15 schrieb Mohammed Rafi K C:
> 
> 
> 
> 
> 
> 
> On 12/03/2016 12:56 AM, Micha Ober wrote:
> 
> 
> 
> ** Update: ** I have downgraded from 3.8.6 to 3.7.17 now, but the problem
> still exists.
> 
> 
> 
> 
> Client log: http://paste.ubuntu.com/ 23569065/
> Brick log: http://paste.ubuntu.com/ 23569067/
> 
> Please note that each server has two bricks.
> Whereas, according to the logs, one brick loses the connection to all other
> hosts:
> [2016-12-02 18:38:53.703301] W [socket.c:596:__socket_rwv]
0-tcp.gv0-server:
> writev on X.X.X.219:49121 failed (Broken pipe)
> [2016-12-02 18:38:53.703381] W [socket.c:596:__socket_rwv]
0-tcp.gv0-server:
> writev on X.X.X.62:49118 failed (Broken pipe)
> [2016-12-02 18:38:53.703380] W [socket.c:596:__socket_rwv]
0-tcp.gv0-server:
> writev on X.X.X.107:49121 failed (Broken pipe)
> [2016-12-02 18:38:53.703424] W [socket.c:596:__socket_rwv]
0-tcp.gv0-server:
> writev on X.X.X.206:49120 failed (Broken pipe)
> [2016-12-02 18:38:53.703359] W [socket.c:596:__socket_rwv]
0-tcp.gv0-server:
> writev on X.X.X.58:49121 failed (Broken pipe)
> 
> The SECOND brick on the SAME host is NOT affected, i.e. no disconnects!
> As I said, the network connection is fine and the disks are idle.
> The CPU always has 2 free cores.
> 
> It looks like I have to downgrade to 3.4 now in order for the disconnects
to
> stop.
> 
> Hi Micha,
> 
> Thanks for the update and sorry for what happened with gluster higher
> versions. I can understand the need for downgrade as it is a production
> setup.
> 
> Can you tell me the clients used here ? whether it is a
fuse,nfs,nfs-ganesha,
> smb or libgfapi ?
> 
> Since I'm not able to reproduce the issue (I have been trying from last
> 3days) and the logs are not much helpful here (we don't have much logs
in
> socket layer), Could you please create a dummy cluster and try to reproduce
> the issue? If then we can play with that volume and I could provide some
> debug build which we can use for further debugging?
> 
> If you don't have bandwidth for this, please leave it ;).
> 
> Regards
> Rafi KC
> 
> 
> 
> 
> - Micha
> 
> Am 30.11.2016 um 06:57 schrieb Mohammed Rafi K C:
> 
> 
> 
> 
> Hi Micha,
> 
> I have changed the thread and subject so that your original thread remain
> same for your query. Let's try to fix the problem what you observed
with
> 3.8.4, So I have started a new thread to discuss the frequent disconnect
> problem.
> 
> If any one else has experienced the same problem, please respond to the
mail.
> 
> 
> It would be very helpful if you could give us some more logs from clients
and
> bricks. Also any reproducible steps will surely help to chase the problem
> further.
> 
> Regards
> 
> Rafi KC
> On 11/30/2016 04:44 AM, Micha Ober wrote:
> 
> 
> 
> I had opened another thread on this mailing list (Subject: "After
upgrade
> from 3.4.2 to 3.8.5 - High CPU usage resulting in disconnects and
> split-brain").
> 
> The title may be a bit misleading now, as I am no longer observing high CPU
> usage after upgrading to 3.8.6, but the disconnects are still happening and
> the number of files in split-brain is growing.
> 
> Setup: 6 compute nodes, each serving as a glusterfs server and client,
Ubuntu
> 14.04, two bricks per node, distribute-replicate
> 
> I have two gluster volumes set up (one for scratch data, one for the slurm
> scheduler). Only the scratch data volume shows critical errors "[...]
has
> not responded in the last 42 seconds, disconnecting.". So I can rule
out
> network problems, the gigabit link between the nodes is not saturated at
> all. The disks are almost idle (<10%).
> 
> I have glusterfs 3.4.2 on Ubuntu 12.04 on a another compute cluster,
running
> fine since it was deployed.
> I had glusterfs 3.4.2 on Ubuntu 14.04 on this cluster, running fine for
> almost a year.
> 
> After upgrading to 3.8.5, the problems (as described) started. I would like
> to use some of the new features of the newer versions (like bitrot), but
the
> users can't run their compute jobs right now because the result files
are
> garbled.
> 
> There also seems to be a bug report with a smiliar problem: (but no
progress)
> https://bugzilla.redhat.com/ show_bug.cgi?id=1370683
> 
> For me, ALL servers are affected (not isolated to one or two servers)
> 
> I also see messages like "INFO: task gpu_graphene_bv:4476 blocked for
more
> than 120 seconds." in the syslog.
> 
> For completeness (gv0 is the scratch volume, gv2 the slurm volume):
> 
> [root at giant2: ~]# gluster v info
> 
> Volume Name: gv0
> Type: Distributed-Replicate
> Volume ID: 993ec7c9-e4bc-44d0-b7c4-2d977e622e86
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 6 x 2 = 12
> Transport-type: tcp
> Bricks:
> Brick1: giant1:/gluster/sdc/gv0
> Brick2: giant2:/gluster/sdc/gv0
> Brick3: giant3:/gluster/sdc/gv0
> Brick4: giant4:/gluster/sdc/gv0
> Brick5: giant5:/gluster/sdc/gv0
> Brick6: giant6:/gluster/sdc/gv0
> Brick7: giant1:/gluster/sdd/gv0
> Brick8: giant2:/gluster/sdd/gv0
> Brick9: giant3:/gluster/sdd/gv0
> Brick10: giant4:/gluster/sdd/gv0
> Brick11: giant5:/gluster/sdd/gv0
> Brick12: giant6:/gluster/sdd/gv0
> Options Reconfigured:
> auth.allow: X.X.X.*,127.0.0.1
> nfs.disable: on
> 
> Volume Name: gv2
> Type: Replicate
> Volume ID: 30c78928-5f2c-4671-becc-8deaee1a7a8d
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x 2 = 2
> Transport-type: tcp
> Bricks:
> Brick1: giant1:/gluster/sdd/gv2
> Brick2: giant2:/gluster/sdd/gv2
> Options Reconfigured:
> auth.allow: X.X.X.*,127.0.0.1
> cluster.granular-entry-heal: on
> cluster.locking-scheme: granular
> nfs.disable: on
> 
> 
> 2016-11-30 0:10 GMT+01:00 Micha Ober < micha2k at gmail.com > :
> 
> 
> 
> There also seems to be a bug report with a smiliar problem: (but no
progress)
> https://bugzilla.redhat.com/sh ow_bug.cgi?id=1370683
> 
> For me, ALL servers are affected (not isolated to one or two servers)
> 
> I also see messages like "INFO: task gpu_graphene_bv:4476 blocked for
more
> than 120 seconds." in the syslog.
> 
> For completeness (gv0 is the scratch volume, gv2 the slurm volume):
> 
> [root at giant2: ~]# gluster v info
> 
> Volume Name: gv0
> Type: Distributed-Replicate
> Volume ID: 993ec7c9-e4bc-44d0-b7c4-2d977e622e86
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 6 x 2 = 12
> Transport-type: tcp
> Bricks:
> Brick1: giant1:/gluster/sdc/gv0
> Brick2: giant2:/gluster/sdc/gv0
> Brick3: giant3:/gluster/sdc/gv0
> Brick4: giant4:/gluster/sdc/gv0
> Brick5: giant5:/gluster/sdc/gv0
> Brick6: giant6:/gluster/sdc/gv0
> Brick7: giant1:/gluster/sdd/gv0
> Brick8: giant2:/gluster/sdd/gv0
> Brick9: giant3:/gluster/sdd/gv0
> Brick10: giant4:/gluster/sdd/gv0
> Brick11: giant5:/gluster/sdd/gv0
> Brick12: giant6:/gluster/sdd/gv0
> Options Reconfigured:
> auth.allow: X.X.X.*,127.0.0.1
> nfs.disable: on
> 
> Volume Name: gv2
> Type: Replicate
> Volume ID: 30c78928-5f2c-4671-becc-8deaee1a7a8d
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x 2 = 2
> Transport-type: tcp
> Bricks:
> Brick1: giant1:/gluster/sdd/gv2
> Brick2: giant2:/gluster/sdd/gv2
> Options Reconfigured:
> auth.allow: X.X.X.*,127.0.0.1
> cluster.granular-entry-heal: on
> cluster.locking-scheme: granular
> nfs.disable: on
> 
> 
> 2016-11-29 19:21 GMT+01:00 Micha Ober < micha2k at gmail.com > :
> 
> 
> 
> I had opened another thread on this mailing list (Subject: "After
upgrade
> from 3.4.2 to 3.8.5 - High CPU usage resulting in disconnects and
> split-brain").
> 
> The title may be a bit misleading now, as I am no longer observing high CPU
> usage after upgrading to 3.8.6, but the disconnects are still happening and
> the number of files in split-brain is growing.
> 
> Setup: 6 compute nodes, each serving as a glusterfs server and client,
Ubuntu
> 14.04, two bricks per node, distribute-replicate
> 
> I have two gluster volumes set up (one for scratch data, one for the slurm
> scheduler). Only the scratch data volume shows critical errors "[...]
has
> not responded in the last 42 seconds, disconnecting.". So I can rule
out
> network problems, the gigabit link between the nodes is not saturated at
> all. The disks are almost idle (<10%).
> 
> I have glusterfs 3.4.2 on Ubuntu 12.04 on a another compute cluster,
running
> fine since it was deployed.
> I had glusterfs 3.4.2 on Ubuntu 14.04 on this cluster, running fine for
> almost a year.
> 
> After upgrading to 3.8.5, the problems (as described) started. I would like
> to use some of the new features of the newer versions (like bitrot), but
the
> users can't run their compute jobs right now because the result files
are
> garbled.
> 
> 2016-11-29 18:53 GMT+01:00 Atin Mukherjee < amukherj at redhat.com >
:
> 
> 
> 
> Would you be able to share what is not working for you in 3.8.x (mention
the
> exact version). 3.4 is quite old and falling back to an unsupported version
> doesn't look a feasible option.
> 
> On Tue, 29 Nov 2016 at 17:01, Micha Ober < micha2k at gmail.com >
wrote:
> 
> 
> 
> Hi,
> 
> I was using gluster 3.4 and upgraded to 3.8, but that version showed to be
> unusable for me. I now need to downgrade.
> 
> I'm running Ubuntu 14.04. As upgrades of the op version are
irreversible, I
> guess I have to delete all gluster volumes and re-create them with the
> downgraded version.
> 
> 0. Backup data
> 1. Unmount all gluster volumes
> 2. apt-get purge glusterfs-server glusterfs-client
> 3. Remove PPA for 3.8
> 4. Add PPA for older version
> 5. apt-get install glusterfs-server glusterfs-client
> 6. Create volumes
> 
> Is "purge" enough to delete all configuration files of the
currently
> installed version or do I need to manually clear some residues before
> installing an older version?
> 
> Thanks.
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman /listinfo/gluster-users
> --
> - Atin (atinm)
> 
> 
> 
> 
> 
> _______________________________________________
> Gluster-users mailing list Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
> 
> 
> 
> 
> 
> 
> 
> --
> ~ Atin (atinm)
> 
> 
> 
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users

Gluster users - Mar 2017 - RE : Frequent connect and disconnect messages flooded in logs

[Gluster-users] RE : Frequent connect and disconnect messages flooded in logs

[Gluster-users] RE : Frequent connect and disconnect messages flooded in logs