thr3ads.net - Gluster users - [Gluster-users] RE : Frequent connect and disconnect messages flooded in logs [Dec 2016]

If this information is useful, please help other people find it:
Share via:

Mohammed Rafi K C

2016-Dec-06 07:15 UTC

[Gluster-users] RE : Frequent connect and disconnect messages flooded in logs

On 12/03/2016 12:56 AM, Micha Ober wrote:> ** Update: ** I have downgraded from 3.8.6 to 3.7.17 now, but the
> problem still exists.
>
> Client log: http://paste.ubuntu.com/23569065/
> Brick log: http://paste.ubuntu.com/23569067/
>
> Please note that each server has two bricks.
> Whereas, according to the logs, one brick loses the connection to all
> other hosts:
> [2016-12-02 18:38:53.703301] W [socket.c:596:__socket_rwv]
0-tcp.gv0-server: writev on X.X.X.219:49121 failed (Broken pipe)
> [2016-12-02 18:38:53.703381] W [socket.c:596:__socket_rwv]
0-tcp.gv0-server: writev on X.X.X.62:49118 failed (Broken pipe)
> [2016-12-02 18:38:53.703380] W [socket.c:596:__socket_rwv]
0-tcp.gv0-server: writev on X.X.X.107:49121 failed (Broken pipe)
> [2016-12-02 18:38:53.703424] W [socket.c:596:__socket_rwv]
0-tcp.gv0-server: writev on X.X.X.206:49120 failed (Broken pipe)
> [2016-12-02 18:38:53.703359] W [socket.c:596:__socket_rwv]
0-tcp.gv0-server: writev on X.X.X.58:49121 failed (Broken pipe)
>
> The SECOND brick on the SAME host is NOT affected, i.e. no disconnects!
> As I said, the network connection is fine and the disks are idle.
> The CPU always has 2 free cores.
>
> It looks like I have to downgrade to 3.4 now in order for the disconnects
to stop.
Hi Micha,

Thanks for the update and sorry for what happened with gluster higher
versions. I can understand the need for downgrade as it is a production
setup.

Can you tell me the clients used here ? whether it is a
fuse,nfs,nfs-ganesha, smb or libgfapi ?

Since I'm not able to reproduce the issue (I have been trying from last
3days) and the logs are not much helpful here (we don't have much logs
in socket layer), Could you please create a dummy cluster and try to
reproduce the issue? If then we can play with that volume and I could
provide some debug build which we can use for further debugging?

If you don't have bandwidth for this, please leave it ;).

Regards
Rafi KC
>
> - Micha
>
> Am 30.11.2016 um 06:57 schrieb Mohammed Rafi K C:
>>
>> Hi Micha,
>>
>> I have changed the thread and subject so that your original thread
>> remain same for your query. Let's try to fix the problem what you
>> observed with 3.8.4, So I have started a new thread to discuss the
>> frequent disconnect problem.
>>
>> *If any one else has experienced the same problem, please respond to
>> the mail.*
>>
>> It would be very helpful if you could give us some more logs from
>> clients and bricks.  Also any reproducible steps will surely help to
>> chase the problem further.
>>
>> Regards
>>
>> Rafi KC
>>
>> On 11/30/2016 04:44 AM, Micha Ober wrote:
>>> I had opened another thread on this mailing list (Subject:
"After
>>> upgrade from 3.4.2 to 3.8.5 - High CPU usage resulting in
>>> disconnects and split-brain").
>>>
>>> The title may be a bit misleading now, as I am no longer observing
>>> high CPU usage after upgrading to 3.8.6, but the disconnects are
>>> still happening and the number of files in split-brain is growing.
>>>
>>> Setup: 6 compute nodes, each serving as a glusterfs server and
>>> client, Ubuntu 14.04, two bricks per node, distribute-replicate
>>>
>>> I have two gluster volumes set up (one for scratch data, one for
the
>>> slurm scheduler). Only the scratch data volume shows critical
errors
>>> "[...] has not responded in the last 42 seconds,
disconnecting.". So
>>> I can rule out network problems, the gigabit link between the nodes
>>> is not saturated at all. The disks are almost idle (<10%).
>>>
>>> I have glusterfs 3.4.2 on Ubuntu 12.04 on a another compute
cluster,
>>> running fine since it was deployed.
>>> I had glusterfs 3.4.2 on Ubuntu 14.04 on this cluster, running fine
>>> for almost a year.
>>>
>>> After upgrading to 3.8.5, the problems (as described) started. I
>>> would like to use some of the new features of the newer versions
>>> (like bitrot), but the users can't run their compute jobs right
now
>>> because the result files are garbled.
>>>
>>> There also seems to be a bug report with a smiliar problem: (but no
>>> progress)
>>> https://bugzilla.redhat.com/show_bug.cgi?id=1370683
>>>
>>> For me, ALL servers are affected (not isolated to one or two
servers)
>>>
>>> I also see messages like "INFO: task gpu_graphene_bv:4476
blocked
>>> for more than 120 seconds." in the syslog.
>>>
>>> For completeness (gv0 is the scratch volume, gv2 the slurm volume):
>>>
>>> [root at giant2: ~]# gluster v info
>>>
>>> Volume Name: gv0
>>> Type: Distributed-Replicate
>>> Volume ID: 993ec7c9-e4bc-44d0-b7c4-2d977e622e86
>>> Status: Started
>>> Snapshot Count: 0
>>> Number of Bricks: 6 x 2 = 12
>>> Transport-type: tcp
>>> Bricks:
>>> Brick1: giant1:/gluster/sdc/gv0
>>> Brick2: giant2:/gluster/sdc/gv0
>>> Brick3: giant3:/gluster/sdc/gv0
>>> Brick4: giant4:/gluster/sdc/gv0
>>> Brick5: giant5:/gluster/sdc/gv0
>>> Brick6: giant6:/gluster/sdc/gv0
>>> Brick7: giant1:/gluster/sdd/gv0
>>> Brick8: giant2:/gluster/sdd/gv0
>>> Brick9: giant3:/gluster/sdd/gv0
>>> Brick10: giant4:/gluster/sdd/gv0
>>> Brick11: giant5:/gluster/sdd/gv0
>>> Brick12: giant6:/gluster/sdd/gv0
>>> Options Reconfigured:
>>> auth.allow: X.X.X.*,127.0.0.1
>>> nfs.disable: on
>>>
>>> Volume Name: gv2
>>> Type: Replicate
>>> Volume ID: 30c78928-5f2c-4671-becc-8deaee1a7a8d
>>> Status: Started
>>> Snapshot Count: 0
>>> Number of Bricks: 1 x 2 = 2
>>> Transport-type: tcp
>>> Bricks:
>>> Brick1: giant1:/gluster/sdd/gv2
>>> Brick2: giant2:/gluster/sdd/gv2
>>> Options Reconfigured:
>>> auth.allow: X.X.X.*,127.0.0.1
>>> cluster.granular-entry-heal: on
>>> cluster.locking-scheme: granular
>>> nfs.disable: on
>>>
>>>
>>> 2016-11-30 0:10 GMT+01:00 Micha Ober <micha2k at gmail.com
>>> <mailto:micha2k at gmail.com>>:
>>>
>>>     There also seems to be a bug report with a smiliar problem:
(but
>>>     no progress)
>>>     https://bugzilla.redhat.com/show_bug.cgi?id=1370683
>>>     <https://bugzilla.redhat.com/show_bug.cgi?id=1370683>
>>>
>>>     For me, ALL servers are affected (not isolated to one or two
>>>     servers)
>>>
>>>     I also see messages like "INFO: task gpu_graphene_bv:4476
>>>     blocked for more than 120 seconds." in the syslog.
>>>
>>>     For completeness (gv0 is the scratch volume, gv2 the slurm
volume):
>>>
>>>     [root at giant2: ~]# gluster v info
>>>
>>>     Volume Name: gv0
>>>     Type: Distributed-Replicate
>>>     Volume ID: 993ec7c9-e4bc-44d0-b7c4-2d977e622e86
>>>     Status: Started
>>>     Snapshot Count: 0
>>>     Number of Bricks: 6 x 2 = 12
>>>     Transport-type: tcp
>>>     Bricks:
>>>     Brick1: giant1:/gluster/sdc/gv0
>>>     Brick2: giant2:/gluster/sdc/gv0
>>>     Brick3: giant3:/gluster/sdc/gv0
>>>     Brick4: giant4:/gluster/sdc/gv0
>>>     Brick5: giant5:/gluster/sdc/gv0
>>>     Brick6: giant6:/gluster/sdc/gv0
>>>     Brick7: giant1:/gluster/sdd/gv0
>>>     Brick8: giant2:/gluster/sdd/gv0
>>>     Brick9: giant3:/gluster/sdd/gv0
>>>     Brick10: giant4:/gluster/sdd/gv0
>>>     Brick11: giant5:/gluster/sdd/gv0
>>>     Brick12: giant6:/gluster/sdd/gv0
>>>     Options Reconfigured:
>>>     auth.allow: X.X.X.*,127.0.0.1
>>>     nfs.disable: on
>>>
>>>     Volume Name: gv2
>>>     Type: Replicate
>>>     Volume ID: 30c78928-5f2c-4671-becc-8deaee1a7a8d
>>>     Status: Started
>>>     Snapshot Count: 0
>>>     Number of Bricks: 1 x 2 = 2
>>>     Transport-type: tcp
>>>     Bricks:
>>>     Brick1: giant1:/gluster/sdd/gv2
>>>     Brick2: giant2:/gluster/sdd/gv2
>>>     Options Reconfigured:
>>>     auth.allow: X.X.X.*,127.0.0.1
>>>     cluster.granular-entry-heal: on
>>>     cluster.locking-scheme: granular
>>>     nfs.disable: on
>>>
>>>
>>>     2016-11-29 19:21 GMT+01:00 Micha Ober <micha2k at
gmail.com>:
>>>
>>>         I had opened another thread on this mailing list (Subject:
>>>         "After upgrade from 3.4.2 to 3.8.5 - High CPU usage
>>>         resulting in disconnects and split-brain").
>>>
>>>         The title may be a bit misleading now, as I am no longer
>>>         observing high CPU usage after upgrading to 3.8.6, but the
>>>         disconnects are still happening and the number of files in
>>>         split-brain is growing.
>>>
>>>         Setup: 6 compute nodes, each serving as a glusterfs server
>>>         and client, Ubuntu 14.04, two bricks per node,
>>>         distribute-replicate
>>>
>>>         I have two gluster volumes set up (one for scratch data,
one
>>>         for the slurm scheduler). Only the scratch data volume
shows
>>>         critical errors "[...] has not responded in the last
42
>>>         seconds, disconnecting.". So I can rule out network
>>>         problems, the gigabit link between the nodes is not
>>>         saturated at all. The disks are almost idle (<10%).
>>>
>>>         I have glusterfs 3.4.2 on Ubuntu 12.04 on a another compute
>>>         cluster, running fine since it was deployed.
>>>         I had glusterfs 3.4.2 on Ubuntu 14.04 on this cluster,
>>>         running fine for almost a year.
>>>
>>>         After upgrading to 3.8.5, the problems (as described)
>>>         started. I would like to use some of the new features of
the
>>>         newer versions (like bitrot), but the users can't run
their
>>>         compute jobs right now because the result files are
garbled.
>>>
>>>         2016-11-29 18:53 GMT+01:00 Atin Mukherjee <amukherj at
redhat.com>:
>>>
>>>             Would you be able to share what is not working for you
>>>             in 3.8.x (mention the exact version). 3.4 is quite old
>>>             and falling back to an unsupported version doesn't
look
>>>             a feasible option.
>>>
>>>             On Tue, 29 Nov 2016 at 17:01, Micha Ober
>>>             <micha2k at gmail.com> wrote:
>>>
>>>                 Hi,
>>>
>>>                 I was using gluster 3.4 and upgraded to 3.8, but
>>>                 that version showed to be unusable for me. I now
>>>                 need to downgrade.
>>>
>>>                 I'm running Ubuntu 14.04. As upgrades of the op
>>>                 version are irreversible, I guess I have to delete
>>>                 all gluster volumes and re-create them with the
>>>                 downgraded version. 
>>>
>>>                 0. Backup data
>>>                 1. Unmount all gluster volumes
>>>                 2. apt-get purge glusterfs-server glusterfs-client
>>>                 3. Remove PPA for 3.8
>>>                 4. Add PPA for older version
>>>                 5. apt-get install glusterfs-server
glusterfs-client
>>>                 6. Create volumes
>>>
>>>                 Is "purge" enough to delete all
configuration files
>>>                 of the currently installed version or do I need to
>>>                  manually clear some residues before installing an
>>>                 older version?
>>>
>>>                 Thanks.
>>>                 _______________________________________________
>>>                 Gluster-users mailing list
>>>                 Gluster-users at gluster.org
>>>                 <mailto:Gluster-users at gluster.org>
>>>                
http://www.gluster.org/mailman/listinfo/gluster-users
>>>                
<http://www.gluster.org/mailman/listinfo/gluster-users>
>>>
>>>             -- 
>>>             - Atin (atinm)
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20161206/6f5302fe/attachment.html>

Micha Ober

2016-Dec-06 23:32 UTC

head link

[Gluster-users] RE : Frequent connect and disconnect messages flooded in logs

Hi,

thank you for your answer and even more for the question!
Until now, I was using FUSE. Today I changed all mounts to NFS using the 
same 3.7.17 version.

But: The problem is still the same. Now, the NFS logfile contains lines 
like these:

[2016-12-06 15:12:29.006325] C 
[rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired] 0-gv0-client-7: server 
X.X.18.62:49153 has not responded in the last 42 seconds, disconnecting.

Interestingly enough,  the IP address X.X.18.62 is the same machine! As 
I wrote earlier, each node serves both as a server and a client, as each 
node contributes bricks to the volume. Every server is connecting to 
itself via its hostname. For example, the fstab on the node "giant2" 
looks like:

#giant2:/gv0    /shared_data    glusterfs       defaults,noauto 0       0
#giant2:/gv2    /shared_slurm   glusterfs       defaults,noauto 0       0

giant2:/gv0     /shared_data    nfs defaults,_netdev,vers=3 0       0
giant2:/gv2     /shared_slurm   nfs defaults,_netdev,vers=3 0       0

So I understand the disconnects even less.

I don't know if it's possible to create a dummy cluster which exposes 
the same behaviour, because the disconnects only happen when there are 
compute jobs running on those nodes - and they are GPU compute jobs, so 
that's something which cannot be easily emulated in a VM.

As we have more clusters (which are running fine with an ancient 3.4 
version :-)) and we are currently not dependent on this particular 
cluster (which may stay like this for this month, I think) I should be 
able to deploy the debug build on the "real" cluster, if you can
provide
a debug build.

Regards and thanks,
Micha



Am 06.12.2016 um 08:15 schrieb Mohammed Rafi K C:>
>
>
> On 12/03/2016 12:56 AM, Micha Ober wrote:
>> ** Update: ** I have downgraded from 3.8.6 to 3.7.17 now, but the 
>> problem still exists.
>>
>> Client log: http://paste.ubuntu.com/23569065/
>> Brick log: http://paste.ubuntu.com/23569067/
>>
>> Please note that each server has two bricks.
>> Whereas, according to the logs, one brick loses the connection to all 
>> other hosts:
>> [2016-12-02 18:38:53.703301] W [socket.c:596:__socket_rwv]
0-tcp.gv0-server: writev on X.X.X.219:49121 failed (Broken pipe)
>> [2016-12-02 18:38:53.703381] W [socket.c:596:__socket_rwv]
0-tcp.gv0-server: writev on X.X.X.62:49118 failed (Broken pipe)
>> [2016-12-02 18:38:53.703380] W [socket.c:596:__socket_rwv]
0-tcp.gv0-server: writev on X.X.X.107:49121 failed (Broken pipe)
>> [2016-12-02 18:38:53.703424] W [socket.c:596:__socket_rwv]
0-tcp.gv0-server: writev on X.X.X.206:49120 failed (Broken pipe)
>> [2016-12-02 18:38:53.703359] W [socket.c:596:__socket_rwv]
0-tcp.gv0-server: writev on X.X.X.58:49121 failed (Broken pipe)
>>
>> The SECOND brick on the SAME host is NOT affected, i.e. no disconnects!
>> As I said, the network connection is fine and the disks are idle.
>> The CPU always has 2 free cores.
>>
>> It looks like I have to downgrade to 3.4 now in order for the
disconnects to stop.
>
> Hi Micha,
>
> Thanks for the update and sorry for what happened with gluster higher 
> versions. I can understand the need for downgrade as it is a 
> production setup.
>
> Can you tell me the clients used here ? whether it is a 
> fuse,nfs,nfs-ganesha, smb or libgfapi ?
>
> Since I'm not able to reproduce the issue (I have been trying from 
> last 3days) and the logs are not much helpful here (we don't have much 
> logs in socket layer), Could you please create a dummy cluster and try 
> to reproduce the issue? If then we can play with that volume and I 
> could provide some debug build which we can use for further debugging?
>
> If you don't have bandwidth for this, please leave it ;).
>
> Regards
> Rafi KC
>
>> - Micha
>>
>> Am 30.11.2016 um 06:57 schrieb Mohammed Rafi K C:
>>>
>>> Hi Micha,
>>>
>>> I have changed the thread and subject so that your original thread 
>>> remain same for your query. Let's try to fix the problem what
you
>>> observed with 3.8.4, So I have started a new thread to discuss the 
>>> frequent disconnect problem.
>>>
>>> *If any one else has experienced the same problem, please respond
to
>>> the mail.*
>>>
>>> It would be very helpful if you could give us some more logs from 
>>> clients and bricks.  Also any reproducible steps will surely help
to
>>> chase the problem further.
>>>
>>> Regards
>>>
>>> Rafi KC
>>>
>>> On 11/30/2016 04:44 AM, Micha Ober wrote:
>>>> I had opened another thread on this mailing list (Subject:
"After
>>>> upgrade from 3.4.2 to 3.8.5 - High CPU usage resulting in 
>>>> disconnects and split-brain").
>>>>
>>>> The title may be a bit misleading now, as I am no longer
observing
>>>> high CPU usage after upgrading to 3.8.6, but the disconnects
are
>>>> still happening and the number of files in split-brain is
growing.
>>>>
>>>> Setup: 6 compute nodes, each serving as a glusterfs server and 
>>>> client, Ubuntu 14.04, two bricks per node, distribute-replicate
>>>>
>>>> I have two gluster volumes set up (one for scratch data, one
for
>>>> the slurm scheduler). Only the scratch data volume shows
critical
>>>> errors "[...] has not responded in the last 42 seconds, 
>>>> disconnecting.". So I can rule out network problems, the
gigabit
>>>> link between the nodes is not saturated at all. The disks are 
>>>> almost idle (<10%).
>>>>
>>>> I have glusterfs 3.4.2 on Ubuntu 12.04 on a another compute 
>>>> cluster, running fine since it was deployed.
>>>> I had glusterfs 3.4.2 on Ubuntu 14.04 on this cluster, running
fine
>>>> for almost a year.
>>>>
>>>> After upgrading to 3.8.5, the problems (as described) started.
I
>>>> would like to use some of the new features of the newer
versions
>>>> (like bitrot), but the users can't run their compute jobs
right now
>>>> because the result files are garbled.
>>>>
>>>> There also seems to be a bug report with a smiliar problem:
(but no
>>>> progress)
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1370683
>>>>
>>>> For me, ALL servers are affected (not isolated to one or two
servers)
>>>>
>>>> I also see messages like "INFO: task gpu_graphene_bv:4476
blocked
>>>> for more than 120 seconds." in the syslog.
>>>>
>>>> For completeness (gv0 is the scratch volume, gv2 the slurm
volume):
>>>>
>>>> [root at giant2: ~]# gluster v info
>>>>
>>>> Volume Name: gv0
>>>> Type: Distributed-Replicate
>>>> Volume ID: 993ec7c9-e4bc-44d0-b7c4-2d977e622e86
>>>> Status: Started
>>>> Snapshot Count: 0
>>>> Number of Bricks: 6 x 2 = 12
>>>> Transport-type: tcp
>>>> Bricks:
>>>> Brick1: giant1:/gluster/sdc/gv0
>>>> Brick2: giant2:/gluster/sdc/gv0
>>>> Brick3: giant3:/gluster/sdc/gv0
>>>> Brick4: giant4:/gluster/sdc/gv0
>>>> Brick5: giant5:/gluster/sdc/gv0
>>>> Brick6: giant6:/gluster/sdc/gv0
>>>> Brick7: giant1:/gluster/sdd/gv0
>>>> Brick8: giant2:/gluster/sdd/gv0
>>>> Brick9: giant3:/gluster/sdd/gv0
>>>> Brick10: giant4:/gluster/sdd/gv0
>>>> Brick11: giant5:/gluster/sdd/gv0
>>>> Brick12: giant6:/gluster/sdd/gv0
>>>> Options Reconfigured:
>>>> auth.allow: X.X.X.*,127.0.0.1
>>>> nfs.disable: on
>>>>
>>>> Volume Name: gv2
>>>> Type: Replicate
>>>> Volume ID: 30c78928-5f2c-4671-becc-8deaee1a7a8d
>>>> Status: Started
>>>> Snapshot Count: 0
>>>> Number of Bricks: 1 x 2 = 2
>>>> Transport-type: tcp
>>>> Bricks:
>>>> Brick1: giant1:/gluster/sdd/gv2
>>>> Brick2: giant2:/gluster/sdd/gv2
>>>> Options Reconfigured:
>>>> auth.allow: X.X.X.*,127.0.0.1
>>>> cluster.granular-entry-heal: on
>>>> cluster.locking-scheme: granular
>>>> nfs.disable: on
>>>>
>>>>
>>>> 2016-11-30 0:10 GMT+01:00 Micha Ober <micha2k at gmail.com 
>>>> <mailto:micha2k at gmail.com>>:
>>>>
>>>>     There also seems to be a bug report with a smiliar problem:
>>>>     (but no progress)
>>>>     https://bugzilla.redhat.com/show_bug.cgi?id=1370683
>>>>     <https://bugzilla.redhat.com/show_bug.cgi?id=1370683>
>>>>
>>>>     For me, ALL servers are affected (not isolated to one or
two
>>>>     servers)
>>>>
>>>>     I also see messages like "INFO: task
gpu_graphene_bv:4476
>>>>     blocked for more than 120 seconds." in the syslog.
>>>>
>>>>     For completeness (gv0 is the scratch volume, gv2 the slurm
volume):
>>>>
>>>>     [root at giant2: ~]# gluster v info
>>>>
>>>>     Volume Name: gv0
>>>>     Type: Distributed-Replicate
>>>>     Volume ID: 993ec7c9-e4bc-44d0-b7c4-2d977e622e86
>>>>     Status: Started
>>>>     Snapshot Count: 0
>>>>     Number of Bricks: 6 x 2 = 12
>>>>     Transport-type: tcp
>>>>     Bricks:
>>>>     Brick1: giant1:/gluster/sdc/gv0
>>>>     Brick2: giant2:/gluster/sdc/gv0
>>>>     Brick3: giant3:/gluster/sdc/gv0
>>>>     Brick4: giant4:/gluster/sdc/gv0
>>>>     Brick5: giant5:/gluster/sdc/gv0
>>>>     Brick6: giant6:/gluster/sdc/gv0
>>>>     Brick7: giant1:/gluster/sdd/gv0
>>>>     Brick8: giant2:/gluster/sdd/gv0
>>>>     Brick9: giant3:/gluster/sdd/gv0
>>>>     Brick10: giant4:/gluster/sdd/gv0
>>>>     Brick11: giant5:/gluster/sdd/gv0
>>>>     Brick12: giant6:/gluster/sdd/gv0
>>>>     Options Reconfigured:
>>>>     auth.allow: X.X.X.*,127.0.0.1
>>>>     nfs.disable: on
>>>>
>>>>     Volume Name: gv2
>>>>     Type: Replicate
>>>>     Volume ID: 30c78928-5f2c-4671-becc-8deaee1a7a8d
>>>>     Status: Started
>>>>     Snapshot Count: 0
>>>>     Number of Bricks: 1 x 2 = 2
>>>>     Transport-type: tcp
>>>>     Bricks:
>>>>     Brick1: giant1:/gluster/sdd/gv2
>>>>     Brick2: giant2:/gluster/sdd/gv2
>>>>     Options Reconfigured:
>>>>     auth.allow: X.X.X.*,127.0.0.1
>>>>     cluster.granular-entry-heal: on
>>>>     cluster.locking-scheme: granular
>>>>     nfs.disable: on
>>>>
>>>>
>>>>     2016-11-29 19:21 GMT+01:00 Micha Ober <micha2k at
gmail.com>:
>>>>
>>>>         I had opened another thread on this mailing list
(Subject:
>>>>         "After upgrade from 3.4.2 to 3.8.5 - High CPU
usage
>>>>         resulting in disconnects and split-brain").
>>>>
>>>>         The title may be a bit misleading now, as I am no
longer
>>>>         observing high CPU usage after upgrading to 3.8.6, but
the
>>>>         disconnects are still happening and the number of files
in
>>>>         split-brain is growing.
>>>>
>>>>         Setup: 6 compute nodes, each serving as a glusterfs
server
>>>>         and client, Ubuntu 14.04, two bricks per node,
>>>>         distribute-replicate
>>>>
>>>>         I have two gluster volumes set up (one for scratch
data,
>>>>         one for the slurm scheduler). Only the scratch data
volume
>>>>         shows critical errors "[...] has not responded in
the last
>>>>         42 seconds, disconnecting.". So I can rule out
network
>>>>         problems, the gigabit link between the nodes is not
>>>>         saturated at all. The disks are almost idle (<10%).
>>>>
>>>>         I have glusterfs 3.4.2 on Ubuntu 12.04 on a another
compute
>>>>         cluster, running fine since it was deployed.
>>>>         I had glusterfs 3.4.2 on Ubuntu 14.04 on this cluster,
>>>>         running fine for almost a year.
>>>>
>>>>         After upgrading to 3.8.5, the problems (as described)
>>>>         started. I would like to use some of the new features
of
>>>>         the newer versions (like bitrot), but the users
can't run
>>>>         their compute jobs right now because the result files
are
>>>>         garbled.
>>>>
>>>>         2016-11-29 18:53 GMT+01:00 Atin Mukherjee
>>>>         <amukherj at redhat.com>:
>>>>
>>>>             Would you be able to share what is not working for
you
>>>>             in 3.8.x (mention the exact version). 3.4 is quite
old
>>>>             and falling back to an unsupported version
doesn't look
>>>>             a feasible option.
>>>>
>>>>             On Tue, 29 Nov 2016 at 17:01, Micha Ober
>>>>             <micha2k at gmail.com> wrote:
>>>>
>>>>                 Hi,
>>>>
>>>>                 I was using gluster 3.4 and upgraded to 3.8,
but
>>>>                 that version showed to be unusable for me. I
now
>>>>                 need to downgrade.
>>>>
>>>>                 I'm running Ubuntu 14.04. As upgrades of
the op
>>>>                 version are irreversible, I guess I have to
delete
>>>>                 all gluster volumes and re-create them with the
>>>>                 downgraded version.
>>>>
>>>>                 0. Backup data
>>>>                 1. Unmount all gluster volumes
>>>>                 2. apt-get purge glusterfs-server
glusterfs-client
>>>>                 3. Remove PPA for 3.8
>>>>                 4. Add PPA for older version
>>>>                 5. apt-get install glusterfs-server
glusterfs-client
>>>>                 6. Create volumes
>>>>
>>>>                 Is "purge" enough to delete all
configuration files
>>>>                 of the currently installed version or do I need
to
>>>>                  manually clear some residues before installing
an
>>>>                 older version?
>>>>
>>>>                 Thanks.
>>>>                 _______________________________________________
>>>>                 Gluster-users mailing list
>>>>                 Gluster-users at gluster.org
>>>>                 <mailto:Gluster-users at gluster.org>
>>>>                
http://www.gluster.org/mailman/listinfo/gluster-users
>>>>                
<http://www.gluster.org/mailman/listinfo/gluster-users>
>>>>
>>>>             -- 
>>>>             - Atin (atinm)
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Gluster-users mailing list
>>>> Gluster-users at gluster.org
>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20161207/bd9bc61a/attachment.html>

Gluster users - Dec 2016 - RE : Frequent connect and disconnect messages flooded in logs

[Gluster-users] RE : Frequent connect and disconnect messages flooded in logs

[Gluster-users] RE : Frequent connect and disconnect messages flooded in logs