thr3ads.net - Gluster users - [Gluster-users] RE : Frequent connect and disconnect messages flooded in logs [Dec 2016]

If this information is useful, please help other people find it:
Share via:

Atin Mukherjee

2016-Dec-08 14:29 UTC

[Gluster-users] RE : Frequent connect and disconnect messages flooded in logs

On Thu, Dec 8, 2016 at 4:37 PM, Micha Ober <micha2k at gmail.com> wrote:
> Hi Rafi,
>
> thank you for your support. It is greatly appreciated.
>
> Just some more thoughts from my side:
>
> There have been no reports from other  users in *this* thread until now,
> but I have found at least one user with a very simiar problem in an older
> thread:
>
> https://www.gluster.org/pipermail/gluster-users/2014-November/019637.html
>
> He is also reporting disconnects  with no apparent reasons, althogh his
> setup is a bit more complicated, also involving a firewall. In our setup,
> all servers/clients are connected via 1 GbE with no firewall or anything
> that might block/throttle traffic. Also, we are using exactly the same
> software versions on all nodes.
>
>
> I can also find some reports in the bugtracker when searching for
> "rpc_client_ping_timer_expired" and
"rpc_clnt_ping_timer_expired" (looks
> like spelling changed during versions).
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1096729
>
Just FYI, this is a different issue, here GlusterD fails to handle the
volume of incoming requests on time since MT-epoll is not enabled here.

>
> https://bugzilla.redhat.com/show_bug.cgi?id=1370683
>
> But both reports involve large traffic/load on the bricks/disks, which is
> not the case for out setup.
> To give a ballpark figure: Over three days, 30 GiB were written. And the
> data was not written at once, but continuously over the whole time.
>
>
> Just to be sure, I have checked the logfiles of one of the other clusters
> right now, which are sitting in the same building, in the same rack, even
> on the same switch, running the same jobs, but with glusterfs 3.4.2 and I
> can see no disconnects in the logfiles. So I can definitely rule out our
> infrastructure as problem.
>
> Regards,
> Micha
>
>
>
> Am 07.12.2016 um 18:08 schrieb Mohammed Rafi K C:
>
> Hi Micha,
>
> This is great. I will provide you one debug build which has two fixes
> which I possible suspect for a frequent disconnect issue, though I
don't
> have much data to validate my theory. So I will take one more day to dig in
> to that.
>
> Thanks for your support, and opensource++
>
> Regards
>
> Rafi KC
> On 12/07/2016 05:02 AM, Micha Ober wrote:
>
> Hi,
>
> thank you for your answer and even more for the question!
> Until now, I was using FUSE. Today I changed all mounts to NFS using the
> same 3.7.17 version.
>
> But: The problem is still the same. Now, the NFS logfile contains lines
> like these:
>
> [2016-12-06 15:12:29.006325] C
[rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired]
> 0-gv0-client-7: server X.X.18.62:49153 has not responded in the last 42
> seconds, disconnecting.
>
> Interestingly enough,  the IP address X.X.18.62 is the same machine! As I
> wrote earlier, each node serves both as a server and a client, as each node
> contributes bricks to the volume. Every server is connecting to itself via
> its hostname. For example, the fstab on the node "giant2" looks
like:
>
> #giant2:/gv0    /shared_data    glusterfs       defaults,noauto 0       0
> #giant2:/gv2    /shared_slurm   glusterfs       defaults,noauto 0       0
>
> giant2:/gv0     /shared_data    nfs             defaults,_netdev,vers=3
> 0       0
> giant2:/gv2     /shared_slurm   nfs             defaults,_netdev,vers=3
> 0       0
>
> So I understand the disconnects even less.
>
> I don't know if it's possible to create a dummy cluster which
exposes the
> same behaviour, because the disconnects only happen when there are compute
> jobs running on those nodes - and they are GPU compute jobs, so that's
> something which cannot be easily emulated in a VM.
>
> As we have more clusters (which are running fine with an ancient 3.4
> version :-)) and we are currently not dependent on this particular cluster
> (which may stay like this for this month, I think) I should be able to
> deploy the debug build on the "real" cluster, if you can provide
a debug
> build.
>
> Regards and thanks,
> Micha
>
>
>
> Am 06.12.2016 um 08:15 schrieb Mohammed Rafi K C:
>
>
>
> On 12/03/2016 12:56 AM, Micha Ober wrote:
>
> ** Update: ** I have downgraded from 3.8.6 to 3.7.17 now, but the problem
> still exists.
>
>
> Client log: http://paste.ubuntu.com/23569065/
> Brick log: http://paste.ubuntu.com/23569067/
>
> Please note that each server has two bricks.
> Whereas, according to the logs, one brick loses the connection to all
> other hosts:
>
> [2016-12-02 18:38:53.703301] W [socket.c:596:__socket_rwv]
0-tcp.gv0-server: writev on X.X.X.219:49121 failed (Broken pipe)
> [2016-12-02 18:38:53.703381] W [socket.c:596:__socket_rwv]
0-tcp.gv0-server: writev on X.X.X.62:49118 failed (Broken pipe)
> [2016-12-02 18:38:53.703380] W [socket.c:596:__socket_rwv]
0-tcp.gv0-server: writev on X.X.X.107:49121 failed (Broken pipe)
> [2016-12-02 18:38:53.703424] W [socket.c:596:__socket_rwv]
0-tcp.gv0-server: writev on X.X.X.206:49120 failed (Broken pipe)
> [2016-12-02 18:38:53.703359] W [socket.c:596:__socket_rwv]
0-tcp.gv0-server: writev on X.X.X.58:49121 failed (Broken pipe)
>
> The SECOND brick on the SAME host is NOT affected, i.e. no disconnects!
> As I said, the network connection is fine and the disks are idle.
> The CPU always has 2 free cores.
>
> It looks like I have to downgrade to 3.4 now in order for the disconnects
to stop.
>
>
> Hi Micha,
>
> Thanks for the update and sorry for what happened with gluster higher
> versions. I can understand the need for downgrade as it is a production
> setup.
>
> Can you tell me the clients used here ? whether it is a
> fuse,nfs,nfs-ganesha, smb or libgfapi ?
>
> Since I'm not able to reproduce the issue (I have been trying from last
> 3days) and the logs are not much helpful here (we don't have much logs
in
> socket layer), Could you please create a dummy cluster and try to reproduce
> the issue? If then we can play with that volume and I could provide some
> debug build which we can use for further debugging?
>
> If you don't have bandwidth for this, please leave it ;).
>
> Regards
> Rafi KC
>
> - Micha
>
>
> Am 30.11.2016 um 06:57 schrieb Mohammed Rafi K C:
>
> Hi Micha,
>
> I have changed the thread and subject so that your original thread remain
> same for your query. Let's try to fix the problem what you observed
with
> 3.8.4, So I have started a new thread to discuss the frequent disconnect
> problem.
>
> *If any one else has experienced the same problem, please respond to the
> mail.*
>
> It would be very helpful if you could give us some more logs from clients
> and bricks.  Also any reproducible steps will surely help to chase the
> problem further.
>
> Regards
>
> Rafi KC
> On 11/30/2016 04:44 AM, Micha Ober wrote:
>
> I had opened another thread on this mailing list (Subject: "After
upgrade
> from 3.4.2 to 3.8.5 - High CPU usage resulting in disconnects and
> split-brain").
>
> The title may be a bit misleading now, as I am no longer observing high
> CPU usage after upgrading to 3.8.6, but the disconnects are still happening
> and the number of files in split-brain is growing.
>
> Setup: 6 compute nodes, each serving as a glusterfs server and client,
> Ubuntu 14.04, two bricks per node, distribute-replicate
>
> I have two gluster volumes set up (one for scratch data, one for the slurm
> scheduler). Only the scratch data volume shows critical errors "[...]
has
> not responded in the last 42 seconds, disconnecting.". So I can rule
out
> network problems, the gigabit link between the nodes is not saturated at
> all. The disks are almost idle (<10%).
>
> I have glusterfs 3.4.2 on Ubuntu 12.04 on a another compute cluster,
> running fine since it was deployed.
> I had glusterfs 3.4.2 on Ubuntu 14.04 on this cluster, running fine for
> almost a year.
>
> After upgrading to 3.8.5, the problems (as described) started. I would
> like to use some of the new features of the newer versions (like bitrot),
> but the users can't run their compute jobs right now because the result
> files are garbled.
>
> There also seems to be a bug report with a smiliar problem: (but no
> progress)
> https://bugzilla.redhat.com/show_bug.cgi?id=1370683
>
> For me, ALL servers are affected (not isolated to one or two servers)
>
> I also see messages like "INFO: task gpu_graphene_bv:4476 blocked for
> more than 120 seconds." in the syslog.
>
> For completeness (gv0 is the scratch volume, gv2 the slurm volume):
>
> [root at giant2: ~]# gluster v info
>
> Volume Name: gv0
> Type: Distributed-Replicate
> Volume ID: 993ec7c9-e4bc-44d0-b7c4-2d977e622e86
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 6 x 2 = 12
> Transport-type: tcp
> Bricks:
> Brick1: giant1:/gluster/sdc/gv0
> Brick2: giant2:/gluster/sdc/gv0
> Brick3: giant3:/gluster/sdc/gv0
> Brick4: giant4:/gluster/sdc/gv0
> Brick5: giant5:/gluster/sdc/gv0
> Brick6: giant6:/gluster/sdc/gv0
> Brick7: giant1:/gluster/sdd/gv0
> Brick8: giant2:/gluster/sdd/gv0
> Brick9: giant3:/gluster/sdd/gv0
> Brick10: giant4:/gluster/sdd/gv0
> Brick11: giant5:/gluster/sdd/gv0
> Brick12: giant6:/gluster/sdd/gv0
> Options Reconfigured:
> auth.allow: X.X.X.*,127.0.0.1
> nfs.disable: on
>
> Volume Name: gv2
> Type: Replicate
> Volume ID: 30c78928-5f2c-4671-becc-8deaee1a7a8d
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x 2 = 2
> Transport-type: tcp
> Bricks:
> Brick1: giant1:/gluster/sdd/gv2
> Brick2: giant2:/gluster/sdd/gv2
> Options Reconfigured:
> auth.allow: X.X.X.*,127.0.0.1
> cluster.granular-entry-heal: on
> cluster.locking-scheme: granular
> nfs.disable: on
>
>
> 2016-11-30 0:10 GMT+01:00 Micha Ober <micha2k at gmail.com>:
>
>> There also seems to be a bug report with a smiliar problem: (but no
>> progress)
>> https://bugzilla.redhat.com/show_bug.cgi?id=1370683
>>
>> For me, ALL servers are affected (not isolated to one or two servers)
>>
>> I also see messages like "INFO: task gpu_graphene_bv:4476 blocked
for
>> more than 120 seconds." in the syslog.
>>
>> For completeness (gv0 is the scratch volume, gv2 the slurm volume):
>>
>> [root at giant2: ~]# gluster v info
>>
>> Volume Name: gv0
>> Type: Distributed-Replicate
>> Volume ID: 993ec7c9-e4bc-44d0-b7c4-2d977e622e86
>> Status: Started
>> Snapshot Count: 0
>> Number of Bricks: 6 x 2 = 12
>> Transport-type: tcp
>> Bricks:
>> Brick1: giant1:/gluster/sdc/gv0
>> Brick2: giant2:/gluster/sdc/gv0
>> Brick3: giant3:/gluster/sdc/gv0
>> Brick4: giant4:/gluster/sdc/gv0
>> Brick5: giant5:/gluster/sdc/gv0
>> Brick6: giant6:/gluster/sdc/gv0
>> Brick7: giant1:/gluster/sdd/gv0
>> Brick8: giant2:/gluster/sdd/gv0
>> Brick9: giant3:/gluster/sdd/gv0
>> Brick10: giant4:/gluster/sdd/gv0
>> Brick11: giant5:/gluster/sdd/gv0
>> Brick12: giant6:/gluster/sdd/gv0
>> Options Reconfigured:
>> auth.allow: X.X.X.*,127.0.0.1
>> nfs.disable: on
>>
>> Volume Name: gv2
>> Type: Replicate
>> Volume ID: 30c78928-5f2c-4671-becc-8deaee1a7a8d
>> Status: Started
>> Snapshot Count: 0
>> Number of Bricks: 1 x 2 = 2
>> Transport-type: tcp
>> Bricks:
>> Brick1: giant1:/gluster/sdd/gv2
>> Brick2: giant2:/gluster/sdd/gv2
>> Options Reconfigured:
>> auth.allow: X.X.X.*,127.0.0.1
>> cluster.granular-entry-heal: on
>> cluster.locking-scheme: granular
>> nfs.disable: on
>>
>>
>> 2016-11-29 19:21 GMT+01:00 Micha Ober <micha2k at gmail.com>:
>>
>>> I had opened another thread on this mailing list (Subject:
"After
>>> upgrade from 3.4.2 to 3.8.5 - High CPU usage resulting in
disconnects and
>>> split-brain").
>>>
>>> The title may be a bit misleading now, as I am no longer observing
high
>>> CPU usage after upgrading to 3.8.6, but the disconnects are still
happening
>>> and the number of files in split-brain is growing.
>>>
>>> Setup: 6 compute nodes, each serving as a glusterfs server and
client,
>>> Ubuntu 14.04, two bricks per node, distribute-replicate
>>>
>>> I have two gluster volumes set up (one for scratch data, one for
the
>>> slurm scheduler). Only the scratch data volume shows critical
errors "[...]
>>> has not responded in the last 42 seconds, disconnecting.". So
I can rule
>>> out network problems, the gigabit link between the nodes is not
saturated
>>> at all. The disks are almost idle (<10%).
>>>
>>> I have glusterfs 3.4.2 on Ubuntu 12.04 on a another compute
cluster,
>>> running fine since it was deployed.
>>> I had glusterfs 3.4.2 on Ubuntu 14.04 on this cluster, running fine
for
>>> almost a year.
>>>
>>> After upgrading to 3.8.5, the problems (as described) started. I
would
>>> like to use some of the new features of the newer versions (like
bitrot),
>>> but the users can't run their compute jobs right now because
the result
>>> files are garbled.
>>>
>>> 2016-11-29 18:53 GMT+01:00 Atin Mukherjee <amukherj at
redhat.com>:
>>>
>>>> Would you be able to share what is not working for you in 3.8.x
>>>> (mention the exact version). 3.4 is quite old and falling back
to an
>>>> unsupported version doesn't look a feasible option.
>>>>
>>>> On Tue, 29 Nov 2016 at 17:01, Micha Ober <micha2k at
gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I was using gluster 3.4 and upgraded to 3.8, but that
version showed
>>>>> to be unusable for me. I now need to downgrade.
>>>>>
>>>>> I'm running Ubuntu 14.04. As upgrades of the op version
>>>>> are irreversible, I guess I have to delete all gluster
volumes and
>>>>> re-create them with the downgraded version.
>>>>>
>>>>> 0. Backup data
>>>>> 1. Unmount all gluster volumes
>>>>> 2. apt-get purge glusterfs-server glusterfs-client
>>>>> 3. Remove PPA for 3.8
>>>>> 4. Add PPA for older version
>>>>> 5. apt-get install glusterfs-server glusterfs-client
>>>>> 6. Create volumes
>>>>>
>>>>> Is "purge" enough to delete all configuration
files of the currently
>>>>> installed version or do I need to  manually clear some
residues before
>>>>> installing an older version?
>>>>>
>>>>> Thanks.
>>>>> _______________________________________________
>>>>> Gluster-users mailing list
>>>>> Gluster-users at gluster.org
>>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>>
>>>> --
>>>> - Atin (atinm)
>>>>
>>>
>>>
>>
>
>
> _______________________________________________
> Gluster-users mailing listGluster-users at
gluster.orghttp://www.gluster.org/mailman/listinfo/gluster-users
>
>
>
>
>
>
>
>

-- 

~ Atin (atinm)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20161208/3afdeef3/attachment.html>

Mohammed Rafi K C

2016-Dec-19 06:28 UTC

head link

[Gluster-users] RE : Frequent connect and disconnect messages flooded in logs

Hi Micha,

Sorry for the late reply. I was busy with some other things.

If you have still the setup available Can you enable TRACE log level
[1],[2] and see if you could find any log entries when the network start
disconnecting. Basically I'm trying to find out any disconnection had
occurred other than ping timer expire issue.



[1] : gluster volume <volname> diagnostics.brick-log-level TRACE

[2] : gluster volume <volname> diagnostics.client-log-level TRACE


Regards

Rafi KC


On 12/08/2016 07:59 PM, Atin Mukherjee wrote:>
>
> On Thu, Dec 8, 2016 at 4:37 PM, Micha Ober <micha2k at gmail.com
> <mailto:micha2k at gmail.com>> wrote:
>
>     Hi Rafi,
>
>     thank you for your support. It is greatly appreciated.
>
>     Just some more thoughts from my side:
>
>     There have been no reports from other  users in *this* thread
>     until now, but I have found at least one user with a very simiar
>     problem in an older thread:
>
>    
https://www.gluster.org/pipermail/gluster-users/2014-November/019637.html
>    
<https://www.gluster.org/pipermail/gluster-users/2014-November/019637.html>
>
>     He is also reporting disconnects  with no apparent reasons,
>     althogh his setup is a bit more complicated, also involving a
>     firewall. In our setup, all servers/clients are connected via 1
>     GbE with no firewall or anything that might block/throttle
>     traffic. Also, we are using exactly the same software versions on
>     all nodes.
>
>
>     I can also find some reports in the bugtracker when searching for
>     "rpc_client_ping_timer_expired" and
"rpc_clnt_ping_timer_expired"
>     (looks like spelling changed during versions).
>
>     https://bugzilla.redhat.com/show_bug.cgi?id=1096729
>     <https://bugzilla.redhat.com/show_bug.cgi?id=1096729>
>
>
> Just FYI, this is a different issue, here GlusterD fails to handle the
> volume of incoming requests on time since MT-epoll is not enabled here.
>  
>
>
>     https://bugzilla.redhat.com/show_bug.cgi?id=1370683
>     <https://bugzilla.redhat.com/show_bug.cgi?id=1370683>
>
>     But both reports involve large traffic/load on the bricks/disks,
>     which is not the case for out setup.
>     To give a ballpark figure: Over three days, 30 GiB were written.
>     And the data was not written at once, but continuously over the
>     whole time.
>
>
>     Just to be sure, I have checked the logfiles of one of the other
>     clusters right now, which are sitting in the same building, in the
>     same rack, even on the same switch, running the same jobs, but
>     with glusterfs 3.4.2 and I can see no disconnects in the logfiles.
>     So I can definitely rule out our infrastructure as problem.
>
>     Regards,
>     Micha
>
>
>
>     Am 07.12.2016 um 18:08 schrieb Mohammed Rafi K C:
>>
>>     Hi Micha,
>>
>>     This is great. I will provide you one debug build which has two
>>     fixes which I possible suspect for a frequent disconnect issue,
>>     though I don't have much data to validate my theory. So I will
>>     take one more day to dig in to that.
>>
>>     Thanks for your support, and opensource++ 
>>
>>     Regards
>>
>>     Rafi KC
>>
>>     On 12/07/2016 05:02 AM, Micha Ober wrote:
>>>     Hi,
>>>
>>>     thank you for your answer and even more for the question!
>>>     Until now, I was using FUSE. Today I changed all mounts to NFS
>>>     using the same 3.7.17 version.
>>>
>>>     But: The problem is still the same. Now, the NFS logfile
>>>     contains lines like these:
>>>
>>>     [2016-12-06 15:12:29.006325] C
>>>     [rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired]
>>>     0-gv0-client-7: server X.X.18.62:49153 has not responded in the
>>>     last 42 seconds, disconnecting.
>>>
>>>     Interestingly enough,  the IP address X.X.18.62 is the same
>>>     machine! As I wrote earlier, each node serves both as a server
>>>     and a client, as each node contributes bricks to the volume.
>>>     Every server is connecting to itself via its hostname. For
>>>     example, the fstab on the node "giant2" looks like:
>>>
>>>     #giant2:/gv0    /shared_data    glusterfs       defaults,noauto
>>>     0       0
>>>     #giant2:/gv2    /shared_slurm   glusterfs       defaults,noauto
>>>     0       0
>>>
>>>     giant2:/gv0     /shared_data    nfs            
>>>     defaults,_netdev,vers=3 0       0
>>>     giant2:/gv2     /shared_slurm   nfs            
>>>     defaults,_netdev,vers=3 0       0
>>>
>>>     So I understand the disconnects even less.
>>>
>>>     I don't know if it's possible to create a dummy cluster
which
>>>     exposes the same behaviour, because the disconnects only happen
>>>     when there are compute jobs running on those nodes - and they
>>>     are GPU compute jobs, so that's something which cannot be
easily
>>>     emulated in a VM.
>>>
>>>     As we have more clusters (which are running fine with an
ancient
>>>     3.4 version :-)) and we are currently not dependent on this
>>>     particular cluster (which may stay like this for this month, I
>>>     think) I should be able to deploy the debug build on the
"real"
>>>     cluster, if you can provide a debug build.
>>>
>>>     Regards and thanks,
>>>     Micha
>>>
>>>
>>>
>>>     Am 06.12.2016 um 08:15 schrieb Mohammed Rafi K C:
>>>>
>>>>
>>>>
>>>>     On 12/03/2016 12:56 AM, Micha Ober wrote:
>>>>>     ** Update: ** I have downgraded from 3.8.6 to 3.7.17
now, but
>>>>>     the problem still exists.
>>>>>
>>>>>     Client log: http://paste.ubuntu.com/23569065/
>>>>>     <http://paste.ubuntu.com/23569065/>
>>>>>     Brick log: http://paste.ubuntu.com/23569067/
>>>>>     <http://paste.ubuntu.com/23569067/>
>>>>>
>>>>>     Please note that each server has two bricks.
>>>>>     Whereas, according to the logs, one brick loses the
connection
>>>>>     to all other hosts:
>>>>>     [2016-12-02 18:38:53.703301] W
[socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.219:49121 failed
(Broken pipe)
>>>>>     [2016-12-02 18:38:53.703381] W
[socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.62:49118 failed
(Broken pipe)
>>>>>     [2016-12-02 18:38:53.703380] W
[socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.107:49121 failed
(Broken pipe)
>>>>>     [2016-12-02 18:38:53.703424] W
[socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.206:49120 failed
(Broken pipe)
>>>>>     [2016-12-02 18:38:53.703359] W
[socket.c:596:__socket_rwv] 0-tcp.gv0-server: writev on X.X.X.58:49121 failed
(Broken pipe)
>>>>>
>>>>>     The SECOND brick on the SAME host is NOT affected, i.e.
no disconnects!
>>>>>     As I said, the network connection is fine and the disks
are idle.
>>>>>     The CPU always has 2 free cores.
>>>>>
>>>>>     It looks like I have to downgrade to 3.4 now in order
for the disconnects to stop.
>>>>
>>>>     Hi Micha,
>>>>
>>>>     Thanks for the update and sorry for what happened with
gluster
>>>>     higher versions. I can understand the need for downgrade as
it
>>>>     is a production setup.
>>>>
>>>>     Can you tell me the clients used here ? whether it is a
>>>>     fuse,nfs,nfs-ganesha, smb or libgfapi ?
>>>>
>>>>     Since I'm not able to reproduce the issue (I have been
trying
>>>>     from last 3days) and the logs are not much helpful here (we
>>>>     don't have much logs in socket layer), Could you please
create
>>>>     a dummy cluster and try to reproduce the issue? If then we
can
>>>>     play with that volume and I could provide some debug build
>>>>     which we can use for further debugging?
>>>>
>>>>     If you don't have bandwidth for this, please leave it
;).
>>>>
>>>>     Regards
>>>>     Rafi KC
>>>>
>>>>>     - Micha
>>>>>
>>>>>     Am 30.11.2016 um 06:57 schrieb Mohammed Rafi K C:
>>>>>>
>>>>>>     Hi Micha,
>>>>>>
>>>>>>     I have changed the thread and subject so that your
original
>>>>>>     thread remain same for your query. Let's try to
fix the
>>>>>>     problem what you observed with 3.8.4, So I have
started a new
>>>>>>     thread to discuss the frequent disconnect problem.
>>>>>>
>>>>>>     *If any one else has experienced the same problem,
please
>>>>>>     respond to the mail.*
>>>>>>
>>>>>>     It would be very helpful if you could give us some
more logs
>>>>>>     from clients and bricks.  Also any reproducible
steps will
>>>>>>     surely help to chase the problem further.
>>>>>>
>>>>>>     Regards
>>>>>>
>>>>>>     Rafi KC
>>>>>>
>>>>>>     On 11/30/2016 04:44 AM, Micha Ober wrote:
>>>>>>>     I had opened another thread on this mailing
list (Subject:
>>>>>>>     "After upgrade from 3.4.2 to 3.8.5 - High
CPU usage
>>>>>>>     resulting in disconnects and
split-brain").
>>>>>>>
>>>>>>>     The title may be a bit misleading now, as I am
no longer
>>>>>>>     observing high CPU usage after upgrading to
3.8.6, but the
>>>>>>>     disconnects are still happening and the number
of files in
>>>>>>>     split-brain is growing.
>>>>>>>
>>>>>>>     Setup: 6 compute nodes, each serving as a
glusterfs server
>>>>>>>     and client, Ubuntu 14.04, two bricks per node,
>>>>>>>     distribute-replicate
>>>>>>>
>>>>>>>     I have two gluster volumes set up (one for
scratch data, one
>>>>>>>     for the slurm scheduler). Only the scratch data
volume shows
>>>>>>>     critical errors "[...] has not responded
in the last 42
>>>>>>>     seconds, disconnecting.". So I can rule
out network
>>>>>>>     problems, the gigabit link between the nodes is
not
>>>>>>>     saturated at all. The disks are almost idle
(<10%).
>>>>>>>
>>>>>>>     I have glusterfs 3.4.2 on Ubuntu 12.04 on a
another compute
>>>>>>>     cluster, running fine since it was deployed.
>>>>>>>     I had glusterfs 3.4.2 on Ubuntu 14.04 on this
cluster,
>>>>>>>     running fine for almost a year.
>>>>>>>
>>>>>>>     After upgrading to 3.8.5, the problems (as
described)
>>>>>>>     started. I would like to use some of the new
features of the
>>>>>>>     newer versions (like bitrot), but the users
can't run their
>>>>>>>     compute jobs right now because the result files
are garbled.
>>>>>>>
>>>>>>>     There also seems to be a bug report with a
smiliar problem:
>>>>>>>     (but no progress)
>>>>>>>    
https://bugzilla.redhat.com/show_bug.cgi?id=1370683
>>>>>>>    
<https://bugzilla.redhat.com/show_bug.cgi?id=1370683>
>>>>>>>
>>>>>>>     For me, ALL servers are affected (not isolated
to one or two
>>>>>>>     servers)
>>>>>>>
>>>>>>>     I also see messages like "INFO: task
gpu_graphene_bv:4476
>>>>>>>     blocked for more than 120 seconds." in the
syslog.
>>>>>>>
>>>>>>>     For completeness (gv0 is the scratch volume,
gv2 the slurm
>>>>>>>     volume):
>>>>>>>
>>>>>>>     [root at giant2: ~]# gluster v info
>>>>>>>
>>>>>>>     Volume Name: gv0
>>>>>>>     Type: Distributed-Replicate
>>>>>>>     Volume ID: 993ec7c9-e4bc-44d0-b7c4-2d977e622e86
>>>>>>>     Status: Started
>>>>>>>     Snapshot Count: 0
>>>>>>>     Number of Bricks: 6 x 2 = 12
>>>>>>>     Transport-type: tcp
>>>>>>>     Bricks:
>>>>>>>     Brick1: giant1:/gluster/sdc/gv0
>>>>>>>     Brick2: giant2:/gluster/sdc/gv0
>>>>>>>     Brick3: giant3:/gluster/sdc/gv0
>>>>>>>     Brick4: giant4:/gluster/sdc/gv0
>>>>>>>     Brick5: giant5:/gluster/sdc/gv0
>>>>>>>     Brick6: giant6:/gluster/sdc/gv0
>>>>>>>     Brick7: giant1:/gluster/sdd/gv0
>>>>>>>     Brick8: giant2:/gluster/sdd/gv0
>>>>>>>     Brick9: giant3:/gluster/sdd/gv0
>>>>>>>     Brick10: giant4:/gluster/sdd/gv0
>>>>>>>     Brick11: giant5:/gluster/sdd/gv0
>>>>>>>     Brick12: giant6:/gluster/sdd/gv0
>>>>>>>     Options Reconfigured:
>>>>>>>     auth.allow: X.X.X.*,127.0.0.1
>>>>>>>     nfs.disable: on
>>>>>>>
>>>>>>>     Volume Name: gv2
>>>>>>>     Type: Replicate
>>>>>>>     Volume ID: 30c78928-5f2c-4671-becc-8deaee1a7a8d
>>>>>>>     Status: Started
>>>>>>>     Snapshot Count: 0
>>>>>>>     Number of Bricks: 1 x 2 = 2
>>>>>>>     Transport-type: tcp
>>>>>>>     Bricks:
>>>>>>>     Brick1: giant1:/gluster/sdd/gv2
>>>>>>>     Brick2: giant2:/gluster/sdd/gv2
>>>>>>>     Options Reconfigured:
>>>>>>>     auth.allow: X.X.X.*,127.0.0.1
>>>>>>>     cluster.granular-entry-heal: on
>>>>>>>     cluster.locking-scheme: granular
>>>>>>>     nfs.disable: on
>>>>>>>
>>>>>>>
>>>>>>>     2016-11-30 0:10 GMT+01:00 Micha Ober
<micha2k at gmail.com
>>>>>>>     <mailto:micha2k at gmail.com>>:
>>>>>>>
>>>>>>>         There also seems to be a bug report with a
smiliar
>>>>>>>         problem: (but no progress)
>>>>>>>        
https://bugzilla.redhat.com/show_bug.cgi?id=1370683
>>>>>>>        
<https://bugzilla.redhat.com/show_bug.cgi?id=1370683>
>>>>>>>
>>>>>>>         For me, ALL servers are affected (not
isolated to one or
>>>>>>>         two servers)
>>>>>>>
>>>>>>>         I also see messages like "INFO: task
>>>>>>>         gpu_graphene_bv:4476 blocked for more than
120 seconds."
>>>>>>>         in the syslog.
>>>>>>>
>>>>>>>         For completeness (gv0 is the scratch
volume, gv2 the
>>>>>>>         slurm volume):
>>>>>>>
>>>>>>>         [root at giant2: ~]# gluster v info
>>>>>>>
>>>>>>>         Volume Name: gv0
>>>>>>>         Type: Distributed-Replicate
>>>>>>>         Volume ID:
993ec7c9-e4bc-44d0-b7c4-2d977e622e86
>>>>>>>         Status: Started
>>>>>>>         Snapshot Count: 0
>>>>>>>         Number of Bricks: 6 x 2 = 12
>>>>>>>         Transport-type: tcp
>>>>>>>         Bricks:
>>>>>>>         Brick1: giant1:/gluster/sdc/gv0
>>>>>>>         Brick2: giant2:/gluster/sdc/gv0
>>>>>>>         Brick3: giant3:/gluster/sdc/gv0
>>>>>>>         Brick4: giant4:/gluster/sdc/gv0
>>>>>>>         Brick5: giant5:/gluster/sdc/gv0
>>>>>>>         Brick6: giant6:/gluster/sdc/gv0
>>>>>>>         Brick7: giant1:/gluster/sdd/gv0
>>>>>>>         Brick8: giant2:/gluster/sdd/gv0
>>>>>>>         Brick9: giant3:/gluster/sdd/gv0
>>>>>>>         Brick10: giant4:/gluster/sdd/gv0
>>>>>>>         Brick11: giant5:/gluster/sdd/gv0
>>>>>>>         Brick12: giant6:/gluster/sdd/gv0
>>>>>>>         Options Reconfigured:
>>>>>>>         auth.allow: X.X.X.*,127.0.0.1
>>>>>>>         nfs.disable: on
>>>>>>>
>>>>>>>         Volume Name: gv2
>>>>>>>         Type: Replicate
>>>>>>>         Volume ID:
30c78928-5f2c-4671-becc-8deaee1a7a8d
>>>>>>>         Status: Started
>>>>>>>         Snapshot Count: 0
>>>>>>>         Number of Bricks: 1 x 2 = 2
>>>>>>>         Transport-type: tcp
>>>>>>>         Bricks:
>>>>>>>         Brick1: giant1:/gluster/sdd/gv2
>>>>>>>         Brick2: giant2:/gluster/sdd/gv2
>>>>>>>         Options Reconfigured:
>>>>>>>         auth.allow: X.X.X.*,127.0.0.1
>>>>>>>         cluster.granular-entry-heal: on
>>>>>>>         cluster.locking-scheme: granular
>>>>>>>         nfs.disable: on
>>>>>>>
>>>>>>>
>>>>>>>         2016-11-29 19:21 GMT+01:00 Micha Ober
<micha2k at gmail.com
>>>>>>>         <mailto:micha2k at gmail.com>>:
>>>>>>>
>>>>>>>             I had opened another thread on this
mailing list
>>>>>>>             (Subject: "After upgrade from
3.4.2 to 3.8.5 - High
>>>>>>>             CPU usage resulting in disconnects and
split-brain").
>>>>>>>
>>>>>>>             The title may be a bit misleading now,
as I am no
>>>>>>>             longer observing high CPU usage after
upgrading to
>>>>>>>             3.8.6, but the disconnects are still
happening and
>>>>>>>             the number of files in split-brain is
growing.
>>>>>>>
>>>>>>>             Setup: 6 compute nodes, each serving as
a glusterfs
>>>>>>>             server and client, Ubuntu 14.04, two
bricks per
>>>>>>>             node, distribute-replicate
>>>>>>>
>>>>>>>             I have two gluster volumes set up (one
for scratch
>>>>>>>             data, one for the slurm scheduler).
Only the scratch
>>>>>>>             data volume shows critical errors
"[...] has not
>>>>>>>             responded in the last 42 seconds,
disconnecting.".
>>>>>>>             So I can rule out network problems, the
gigabit link
>>>>>>>             between the nodes is not saturated at
all. The disks
>>>>>>>             are almost idle (<10%).
>>>>>>>
>>>>>>>             I have glusterfs 3.4.2 on Ubuntu 12.04
on a another
>>>>>>>             compute cluster, running fine since it
was deployed.
>>>>>>>             I had glusterfs 3.4.2 on Ubuntu 14.04
on this
>>>>>>>             cluster, running fine for almost a
year.
>>>>>>>
>>>>>>>             After upgrading to 3.8.5, the problems
(as
>>>>>>>             described) started. I would like to use
some of the
>>>>>>>             new features of the newer versions
(like bitrot),
>>>>>>>             but the users can't run their
compute jobs right now
>>>>>>>             because the result files are garbled.
>>>>>>>
>>>>>>>             2016-11-29 18:53 GMT+01:00 Atin
Mukherjee
>>>>>>>             <amukherj at redhat.com
<mailto:amukherj at redhat.com>>:
>>>>>>>
>>>>>>>                 Would you be able to share what is
not working
>>>>>>>                 for you in 3.8.x (mention the exact
version).
>>>>>>>                 3.4 is quite old and falling back
to an
>>>>>>>                 unsupported version doesn't
look a feasible option.
>>>>>>>
>>>>>>>                 On Tue, 29 Nov 2016 at 17:01, Micha
Ober
>>>>>>>                 <micha2k at gmail.com
<mailto:micha2k at gmail.com>>
>>>>>>>                 wrote:
>>>>>>>
>>>>>>>                     Hi,
>>>>>>>
>>>>>>>                     I was using gluster 3.4 and
upgraded to 3.8,
>>>>>>>                     but that version showed to be
unusable for
>>>>>>>                     me. I now need to downgrade.
>>>>>>>
>>>>>>>                     I'm running Ubuntu 14.04.
As upgrades of the
>>>>>>>                     op version are irreversible, I
guess I have
>>>>>>>                     to delete all gluster volumes
and re-create
>>>>>>>                     them with the downgraded
version.
>>>>>>>
>>>>>>>                     0. Backup data
>>>>>>>                     1. Unmount all gluster volumes
>>>>>>>                     2. apt-get purge
glusterfs-server
>>>>>>>                     glusterfs-client
>>>>>>>                     3. Remove PPA for 3.8
>>>>>>>                     4. Add PPA for older version
>>>>>>>                     5. apt-get install
glusterfs-server
>>>>>>>                     glusterfs-client
>>>>>>>                     6. Create volumes
>>>>>>>
>>>>>>>                     Is "purge" enough to
delete all
>>>>>>>                     configuration files of the
currently
>>>>>>>                     installed version or do I need
to  manually
>>>>>>>                     clear some residues before
installing an
>>>>>>>                     older version?
>>>>>>>
>>>>>>>                     Thanks.
>>>>>>>                    
_______________________________________________
>>>>>>>                     Gluster-users mailing list
>>>>>>>                     Gluster-users at gluster.org
>>>>>>>                     <mailto:Gluster-users at
gluster.org>
>>>>>>>                    
http://www.gluster.org/mailman/listinfo/gluster-users
>>>>>>>                    
<http://www.gluster.org/mailman/listinfo/gluster-users>
>>>>>>>
>>>>>>>                 -- 
>>>>>>>                 - Atin (atinm)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>     _______________________________________________
>>>>>>>     Gluster-users mailing list
>>>>>>>     Gluster-users at gluster.org
<mailto:Gluster-users at gluster.org>
>>>>>>>    
http://www.gluster.org/mailman/listinfo/gluster-users
>>>>>>>    
<http://www.gluster.org/mailman/listinfo/gluster-users>
>>>>>
> -- 
> ~ Atin (atinm)-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20161219/178402e3/attachment.html>

Gluster users - Dec 2016 - RE : Frequent connect and disconnect messages flooded in logs

[Gluster-users] RE : Frequent connect and disconnect messages flooded in logs

[Gluster-users] RE : Frequent connect and disconnect messages flooded in logs