thr3ads.net - Gluster users - [Gluster-users] Change transport-type on volume from tcp to rdma, tcp [Jul 2015]

If this information is useful, please help other people find it:
Share via:

Mohammed Rafi K C

2015-Jul-22 05:33 UTC

[Gluster-users] Change transport-type on volume from tcp to rdma, tcp

On 07/22/2015 04:51 AM, Geoffrey Letessier wrote:> Hi Niels,
>
> Thanks for replying. 
>
> In fact, after having checked the log, I've discovered GlusterFS tried
> to connect a brick with a TCP (or RDMA) port allocated to another
> volume? (bug?)
> For example, here is a extract of my workdir.log file :
> [2015-07-21 21:34:01.820188] E [socket.c:2332:socket_connect_finish]
> 0-vol_workdir_amd-client-0: connection to 10.0.4.1:49161 failed
> (Connexion refus?e)
> [2015-07-21 21:34:01.822563] E [socket.c:2332:socket_connect_finish]
> 0-vol_workdir_amd-client-2: connection to 10.0.4.1:49162 failed
> (Connexion refus?e)
>
> But the 2 ports (49161 and 49162) concerned only my vol_home volume,
> not the vol_workdir_amd one.
>
> Now, after having restart all glusterd synchronously (pdsh -w
> cl-storage[1-4] service glusterd restart), all seems to be back into a
> normal situation (size, write permission, etc.)
>
> But, a few minutes later, i note a strange thing I notice since i?ve
> upgraded my cluster storage from 3.5.3 to 3.7.2-3: when I try to mount
> some volume (particularly my vol_shared volume (replicated volume)) my
> system can hang? And, because I use it in my bashrc file for my
> environment modules, i need to restart my node. Idem if I try to do a
> DF on my mounted volume (if it doesn?t hang during the mount).
>
> With TCP transport-type, the situation seems to be more stable..
>
> In addition: If I restart a storage node, I can?t use Gluster CLI (it
> also hang).
>
> Do you have an idea?
Are you using bash script to start/mount the volume ? If so, add a sleep
after volume start and mount, to allow all the process to start
properly. Because RDMA protocol will take some time to init the resources.

Regards
Rafi KC


>
> One more time, thanks a lot for your help,
> Geoffrey
>
> ------------------------------------------------------
> Geoffrey Letessier
> Responsable informatique & ing?nieur syst?me
> UPR 9080 - CNRS - Laboratoire de Biochimie Th?orique
> Institut de Biologie Physico-Chimique
> 13, rue Pierre et Marie Curie - 75005 Paris
> Tel: 01 58 41 50 93 - eMail: geoffrey.letessier at ibpc.fr
> <mailto:geoffrey.letessier at ibpc.fr>
>
> Le 21 juil. 2015 ? 23:49, Niels de Vos <ndevos at redhat.com
> <mailto:ndevos at redhat.com>> a ?crit :
>
>> On Tue, Jul 21, 2015 at 11:20:20PM +0200, Geoffrey Letessier wrote:
>>> Hello Soumya, Hello everybody,
>>>
>>> network.ping-timeout was set to 42 seconds. I set it to 0 but no
>>> difference. The problem was, after having re-set le transport-type
to
>>> rdma,tcp some brick down after a few minutes.. Despite of
restarting
>>> volumes, after a few minutes, some [other/different] bricks down
>>> again.
>>
>> I'm not sure how if the ping-timeout is differently handled when
RDMA is
>> used. Adding two of the guys that know RDMA well on CC.
>>
>>> Now, after re-creation of my volume, bricks keep alive but, oddly,
i?m
>>> not able to write on my volume. In addition, I defined a
distributed
>>> volume with 2 servers, 4 bricks of 250GB each and my final volume
>>> seems to be only sized to 500GB? It?s amazing..
>>
>> As seen further below, the 500GB volume is caused by two unreachable
>> bricks. When the bricks are not reachable, the size of the bricks can
>> not be detected by the client and therefore 2x 250 GB is missing.
>>
>> It is unclear to me why writing to a pure distributed volume fails.
When
>> a brick is not reachable, and the file should be created there, it
>> would normally get created on an other brick. When the brick that
should
>> have the file gets online, and a new lookup for the file is done, a so
>> called "link file" is created, which points to the file on
the other
>> brick. I guess the failure has to do with the connection issues, and I
>> would suggest to get that solved first.
>>
>> HTH,
>> Niels
>>
>>
>>> Here you can find some information:
>>> # gluster volume status vol_workdir_amd
>>> Status of volume: vol_workdir_amd
>>> Gluster process                             TCP Port  RDMA Port
>>>  Online  Pid
>>>
------------------------------------------------------------------------------
>>> Brick ib-storage1:/export/brick_workdir/bri
>>> ck1/data                                    49185     49186      Y
>>>       23098
>>> Brick ib-storage3:/export/brick_workdir/bri
>>> ck1/data                                    49158     49159      Y
>>>       3886
>>> Brick ib-storage1:/export/brick_workdir/bri
>>> ck2/data                                    49187     49188      Y
>>>       23117
>>> Brick ib-storage3:/export/brick_workdir/bri
>>> ck2/data                                    49160     49161      Y
>>>       3905
>>>
>>> # gluster volume info vol_workdir_amd
>>>
>>> Volume Name: vol_workdir_amd
>>> Type: Distribute
>>> Volume ID: 087d26ea-c6df-4cbe-94af-ecd87b59aedb
>>> Status: Started
>>> Number of Bricks: 4
>>> Transport-type: tcp,rdma
>>> Bricks:
>>> Brick1: ib-storage1:/export/brick_workdir/brick1/data
>>> Brick2: ib-storage3:/export/brick_workdir/brick1/data
>>> Brick3: ib-storage1:/export/brick_workdir/brick2/data
>>> Brick4: ib-storage3:/export/brick_workdir/brick2/data
>>> Options Reconfigured:
>>> performance.readdir-ahead: on
>>>
>>> # pdsh -w storage[1,3] df -h /export/brick_workdir/brick{1,2}
>>> storage3: Filesystem            Size  Used Avail Use% Mounted on
>>> storage3: /dev/mapper/st--block1-blk1--workdir
>>> storage3:                       250G   34M  250G   1%
>>> /export/brick_workdir/brick1
>>> storage3: /dev/mapper/st--block2-blk2--workdir
>>> storage3:                       250G   34M  250G   1%
>>> /export/brick_workdir/brick2
>>> storage1: Filesystem            Size  Used Avail Use% Mounted on
>>> storage1: /dev/mapper/st--block1-blk1--workdir
>>> storage1:                       250G   33M  250G   1%
>>> /export/brick_workdir/brick1
>>> storage1: /dev/mapper/st--block2-blk2--workdir
>>> storage1:                       250G   33M  250G   1%
>>> /export/brick_workdir/brick2
>>>
>>> # df -h /workdir/
>>> Filesystem            Size  Used Avail Use% Mounted on
>>> localhost:vol_workdir_amd.rdma
>>>                      500G   67M  500G   1% /workdir
>>>
>>> # touch /workdir/test
>>> touch: impossible de faire un touch ? /workdir/test ?: Aucun
fichier
>>> ou dossier de ce type
>>>
>>> # tail -30l /var/log/glusterfs/workdir.log
>>> Host Unreachable, Check your connection with IPoIB
>>> [2015-07-21 21:10:33.927673] W
>>> [rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-2:
>>> cma event RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1020
>>> peer:10.0.4.1:49174)
>>> Host Unreachable, Check your connection with IPoIB
>>> [2015-07-21 21:10:37.877231] I [rpc-clnt.c:1819:rpc_clnt_reconfig]
>>> 0-vol_workdir_amd-client-0: changing port to 49173 (from 0)
>>> [2015-07-21 21:10:37.880556] I [rpc-clnt.c:1819:rpc_clnt_reconfig]
>>> 0-vol_workdir_amd-client-2: changing port to 49174 (from 0)
>>> [2015-07-21 21:10:37.914661] W
>>> [rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-0:
>>> cma event RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1021
>>> peer:10.0.4.1:49173)
>>> Host Unreachable, Check your connection with IPoIB
>>> [2015-07-21 21:10:37.923535] W
>>> [rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-2:
>>> cma event RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1020
>>> peer:10.0.4.1:49174)
>>> Host Unreachable, Check your connection with IPoIB
>>> [2015-07-21 21:10:41.883925] I [rpc-clnt.c:1819:rpc_clnt_reconfig]
>>> 0-vol_workdir_amd-client-0: changing port to 49173 (from 0)
>>> [2015-07-21 21:10:41.887085] I [rpc-clnt.c:1819:rpc_clnt_reconfig]
>>> 0-vol_workdir_amd-client-2: changing port to 49174 (from 0)
>>> [2015-07-21 21:10:41.919394] W
>>> [rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-0:
>>> cma event RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1021
>>> peer:10.0.4.1:49173)
>>> Host Unreachable, Check your connection with IPoIB
>>> [2015-07-21 21:10:41.932622] W
>>> [rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-2:
>>> cma event RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1020
>>> peer:10.0.4.1:49174)
>>> Host Unreachable, Check your connection with IPoIB
>>> [2015-07-21 21:10:44.682636] W [dht-layout.c:189:dht_layout_search]
>>> 0-vol_workdir_amd-dht: no subvolume for hash (value) = 1072520554
>>> [2015-07-21 21:10:44.682947] W [dht-layout.c:189:dht_layout_search]
>>> 0-vol_workdir_amd-dht: no subvolume for hash (value) = 1072520554
>>> [2015-07-21 21:10:44.683240] W [dht-layout.c:189:dht_layout_search]
>>> 0-vol_workdir_amd-dht: no subvolume for hash (value) = 1072520554
>>> [2015-07-21 21:10:44.683472] W [dht-diskusage.c:48:dht_du_info_cbk]
>>> 0-vol_workdir_amd-dht: failed to get disk info from
>>> vol_workdir_amd-client-0
>>> [2015-07-21 21:10:44.683506] W [dht-diskusage.c:48:dht_du_info_cbk]
>>> 0-vol_workdir_amd-dht: failed to get disk info from
>>> vol_workdir_amd-client-2
>>> [2015-07-21 21:10:44.683532] W [dht-layout.c:189:dht_layout_search]
>>> 0-vol_workdir_amd-dht: no subvolume for hash (value) = 1072520554
>>> [2015-07-21 21:10:44.683551] W [fuse-bridge.c:1970:fuse_create_cbk]
>>> 0-glusterfs-fuse: 18: /test => -1 (Aucun fichier ou dossier de
ce type)
>>> [2015-07-21 21:10:44.683619] W [dht-layout.c:189:dht_layout_search]
>>> 0-vol_workdir_amd-dht: no subvolume for hash (value) = 1072520554
>>> [2015-07-21 21:10:44.683846] W [dht-layout.c:189:dht_layout_search]
>>> 0-vol_workdir_amd-dht: no subvolume for hash (value) = 1072520554
>>> [2015-07-21 21:10:45.886807] I [rpc-clnt.c:1819:rpc_clnt_reconfig]
>>> 0-vol_workdir_amd-client-0: changing port to 49173 (from 0)
>>> [2015-07-21 21:10:45.893059] I [rpc-clnt.c:1819:rpc_clnt_reconfig]
>>> 0-vol_workdir_amd-client-2: changing port to 49174 (from 0)
>>> [2015-07-21 21:10:45.920434] W
>>> [rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-0:
>>> cma event RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1021
>>> peer:10.0.4.1:49173)
>>> Host Unreachable, Check your connection with IPoIB
>>> [2015-07-21 21:10:45.925292] W
>>> [rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-2:
>>> cma event RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1020
>>> peer:10.0.4.1:49174)
>>> Host Unreachable, Check your connection with IPoIB
>>>
>>> I use GlusterFS in production since around 3 years without any
block
>>> problem but now the situation is awesome since more than 3 weeks?
>>> Indeed, our production are down since roughly 3.5 weeks (with a lot
>>> and different problems with GlusterFS v3.5.3 and now with 3.7.2-3)
and
>>> i need to restart it?
>>>
>>> Thanks in advance,
>>> Geoffrey
>>> ------------------------------------------------------
>>> Geoffrey Letessier
>>> Responsable informatique & ing?nieur syst?me
>>> UPR 9080 - CNRS - Laboratoire de Biochimie Th?orique
>>> Institut de Biologie Physico-Chimique
>>> 13, rue Pierre et Marie Curie - 75005 Paris
>>> Tel: 01 58 41 50 93 - eMail: geoffrey.letessier at ibpc.fr
>>> <mailto:geoffrey.letessier at ibpc.fr>
>>>
>>> Le 21 juil. 2015 ? 19:36, Soumya Koduri <skoduri at redhat.com
>>> <mailto:skoduri at redhat.com>> a ?crit :
>>>
>>>> From the following errors,
>>>>
>>>> [2015-07-21 14:36:30.495321] I [MSGID: 114020]
>>>> [client.c:2118:notify] 0-vol_shared-client-0: parent
translators
>>>> are ready, attempting connect on transport
>>>> [2015-07-21 14:36:30.498989] W
[socket.c:923:__socket_keepalive]
>>>> 0-socket: failed to set TCP_USER_TIMEOUT 0 on socket 12,
Protocole
>>>> non disponible
>>>> [2015-07-21 14:36:30.499004] E [socket.c:3015:socket_connect]
>>>> 0-vol_shared-client-0: Failed to set keep-alive: Protocole non
>>>> disponible
>>>>
>>>> looks like setting TCP_USER_TIMEOUT value to 0 on the socket
failed
>>>> with error (IIUC) "Protocol not available".
>>>> Could you check if 'network.ping-timeout' is set to
zero for that
>>>> volume using 'gluster volume info'? Anyways from the
code looks
>>>> like 'TCP_USER_TIMEOUT' can take value zero. Not sure
why it has
>>>> failed.
>>>>
>>>> Niels, any thoughts?
>>>>
>>>> Thanks,
>>>> Soumya
>>>>
>>>> On 07/21/2015 08:15 PM, Geoffrey Letessier wrote:
>>>>> [2015-07-21 14:36:30.495321] I [MSGID: 114020]
[client.c:2118:notify]
>>>>> 0-vol_shared-client-0: parent translators are ready,
attempting
>>>>> connect
>>>>> on transport
>>>>> [2015-07-21 14:36:30.498989] W
[socket.c:923:__socket_keepalive]
>>>>> 0-socket: failed to set TCP_USER_TIMEOUT 0 on socket 12,
Protocole non
>>>>> disponible
>>>>> [2015-07-21 14:36:30.499004] E
[socket.c:3015:socket_connect]
>>>>> 0-vol_shared-client-0: Failed to set keep-alive: Protocole
non
>>>>> disponible
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20150722/f570b7f8/attachment.html>

Geoffrey Letessier

2015-Jul-22 07:17 UTC

head link

[Gluster-users] Change transport-type on volume from tcp to rdma, tcp

Hi Rafi,

It?s what I do. But I note particularly this kind of trouble when I mount my
volumes manually.

In addition, when I changed my transport-type from tcp or rdma to tcp,rdma, I
have had to restart my volume in order they can took effect.

I wonder if these trouble are not due to RDMA protocol? because it looks like
more stable with TCP one.

Another idea?
Thanks for replying and by advance,
Geoffrey
------------------------------------------------------
Geoffrey Letessier
Responsable informatique & ing?nieur syst?me
UPR 9080 - CNRS - Laboratoire de Biochimie Th?orique
Institut de Biologie Physico-Chimique
13, rue Pierre et Marie Curie - 75005 Paris
Tel: 01 58 41 50 93 - eMail: geoffrey.letessier at ibpc.fr

Le 22 juil. 2015 ? 07:33, Mohammed Rafi K C <rkavunga at redhat.com> a
?crit :
> 
> 
> On 07/22/2015 04:51 AM, Geoffrey Letessier wrote:
>> Hi Niels,
>> 
>> Thanks for replying. 
>> 
>> In fact, after having checked the log, I've discovered GlusterFS
tried to connect a brick with a TCP (or RDMA) port allocated to another volume?
(bug?)
>> For example, here is a extract of my workdir.log file :
>> [2015-07-21 21:34:01.820188] E [socket.c:2332:socket_connect_finish]
0-vol_workdir_amd-client-0: connection to 10.0.4.1:49161 failed (Connexion
refus?e)
>> [2015-07-21 21:34:01.822563] E [socket.c:2332:socket_connect_finish]
0-vol_workdir_amd-client-2: connection to 10.0.4.1:49162 failed (Connexion
refus?e)
>> 
>> But the 2 ports (49161 and 49162) concerned only my vol_home volume,
not the vol_workdir_amd one.
>> 
>> Now, after having restart all glusterd synchronously (pdsh -w
cl-storage[1-4] service glusterd restart), all seems to be back into a normal
situation (size, write permission, etc.)
>> 
>> But, a few minutes later, i note a strange thing I notice since i?ve
upgraded my cluster storage from 3.5.3 to 3.7.2-3: when I try to mount some
volume (particularly my vol_shared volume (replicated volume)) my system can
hang? And, because I use it in my bashrc file for my environment modules, i need
to restart my node. Idem if I try to do a DF on my mounted volume (if it doesn?t
hang during the mount).
>> 
>> With TCP transport-type, the situation seems to be more stable..
>> 
>> In addition: If I restart a storage node, I can?t use Gluster CLI (it
also hang).
>> 
>> Do you have an idea?
> 
> Are you using bash script to start/mount the volume ? If so, add a sleep
after volume start and mount, to allow all the process to start properly.
Because RDMA protocol will take some time to init the resources.
> 
> Regards
> Rafi KC
> 
> 
> 
>> 
>> One more time, thanks a lot for your help,
>> Geoffrey
>> 
>> ------------------------------------------------------
>> Geoffrey Letessier
>> Responsable informatique & ing?nieur syst?me
>> UPR 9080 - CNRS - Laboratoire de Biochimie Th?orique
>> Institut de Biologie Physico-Chimique
>> 13, rue Pierre et Marie Curie - 75005 Paris
>> Tel: 01 58 41 50 93 - eMail: geoffrey.letessier at ibpc.fr
>> 
>> Le 21 juil. 2015 ? 23:49, Niels de Vos <ndevos at redhat.com> a
?crit :
>> 
>>> On Tue, Jul 21, 2015 at 11:20:20PM +0200, Geoffrey Letessier wrote:
>>>> Hello Soumya, Hello everybody,
>>>> 
>>>> network.ping-timeout was set to 42 seconds. I set it to 0 but
no
>>>> difference. The problem was, after having re-set le
transport-type to
>>>> rdma,tcp some brick down after a few minutes.. Despite of
restarting
>>>> volumes, after a few minutes, some [other/different] bricks
down
>>>> again.
>>> 
>>> I'm not sure how if the ping-timeout is differently handled
when RDMA is
>>> used. Adding two of the guys that know RDMA well on CC.
>>> 
>>>> Now, after re-creation of my volume, bricks keep alive but,
oddly, i?m
>>>> not able to write on my volume. In addition, I defined a
distributed
>>>> volume with 2 servers, 4 bricks of 250GB each and my final
volume
>>>> seems to be only sized to 500GB? It?s amazing.. 
>>> 
>>> As seen further below, the 500GB volume is caused by two
unreachable
>>> bricks. When the bricks are not reachable, the size of the bricks
can
>>> not be detected by the client and therefore 2x 250 GB is missing.
>>> 
>>> It is unclear to me why writing to a pure distributed volume fails.
When
>>> a brick is not reachable, and the file should be created there, it
>>> would normally get created on an other brick. When the brick that
should
>>> have the file gets online, and a new lookup for the file is done, a
so
>>> called "link file" is created, which points to the file
on the other
>>> brick. I guess the failure has to do with the connection issues,
and I
>>> would suggest to get that solved first.
>>> 
>>> HTH,
>>> Niels
>>> 
>>> 
>>>> Here you can find some information:
>>>> # gluster volume status vol_workdir_amd
>>>> Status of volume: vol_workdir_amd
>>>> Gluster process                             TCP Port  RDMA Port
Online  Pid
>>>>
------------------------------------------------------------------------------
>>>> Brick ib-storage1:/export/brick_workdir/bri
>>>> ck1/data                                    49185     49186    
Y       23098
>>>> Brick ib-storage3:/export/brick_workdir/bri
>>>> ck1/data                                    49158     49159    
Y       3886
>>>> Brick ib-storage1:/export/brick_workdir/bri
>>>> ck2/data                                    49187     49188    
Y       23117
>>>> Brick ib-storage3:/export/brick_workdir/bri
>>>> ck2/data                                    49160     49161    
Y       3905
>>>> 
>>>> # gluster volume info vol_workdir_amd
>>>> 
>>>> Volume Name: vol_workdir_amd
>>>> Type: Distribute
>>>> Volume ID: 087d26ea-c6df-4cbe-94af-ecd87b59aedb
>>>> Status: Started
>>>> Number of Bricks: 4
>>>> Transport-type: tcp,rdma
>>>> Bricks:
>>>> Brick1: ib-storage1:/export/brick_workdir/brick1/data
>>>> Brick2: ib-storage3:/export/brick_workdir/brick1/data
>>>> Brick3: ib-storage1:/export/brick_workdir/brick2/data
>>>> Brick4: ib-storage3:/export/brick_workdir/brick2/data
>>>> Options Reconfigured:
>>>> performance.readdir-ahead: on
>>>> 
>>>> # pdsh -w storage[1,3] df -h /export/brick_workdir/brick{1,2}
>>>> storage3: Filesystem            Size  Used Avail Use% Mounted
on
>>>> storage3: /dev/mapper/st--block1-blk1--workdir
>>>> storage3:                       250G   34M  250G   1%
/export/brick_workdir/brick1
>>>> storage3: /dev/mapper/st--block2-blk2--workdir
>>>> storage3:                       250G   34M  250G   1%
/export/brick_workdir/brick2
>>>> storage1: Filesystem            Size  Used Avail Use% Mounted
on
>>>> storage1: /dev/mapper/st--block1-blk1--workdir
>>>> storage1:                       250G   33M  250G   1%
/export/brick_workdir/brick1
>>>> storage1: /dev/mapper/st--block2-blk2--workdir
>>>> storage1:                       250G   33M  250G   1%
/export/brick_workdir/brick2
>>>> 
>>>> # df -h /workdir/
>>>> Filesystem            Size  Used Avail Use% Mounted on
>>>> localhost:vol_workdir_amd.rdma
>>>>                      500G   67M  500G   1% /workdir
>>>> 
>>>> # touch /workdir/test
>>>> touch: impossible de faire un touch ? /workdir/test ?: Aucun
fichier ou dossier de ce type
>>>> 
>>>> # tail -30l /var/log/glusterfs/workdir.log 
>>>> Host Unreachable, Check your connection with IPoIB
>>>> [2015-07-21 21:10:33.927673] W
[rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-2: cma event
RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1020 peer:10.0.4.1:49174)
>>>> Host Unreachable, Check your connection with IPoIB
>>>> [2015-07-21 21:10:37.877231] I
[rpc-clnt.c:1819:rpc_clnt_reconfig] 0-vol_workdir_amd-client-0: changing port to
49173 (from 0)
>>>> [2015-07-21 21:10:37.880556] I
[rpc-clnt.c:1819:rpc_clnt_reconfig] 0-vol_workdir_amd-client-2: changing port to
49174 (from 0)
>>>> [2015-07-21 21:10:37.914661] W
[rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-0: cma event
RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1021 peer:10.0.4.1:49173)
>>>> Host Unreachable, Check your connection with IPoIB
>>>> [2015-07-21 21:10:37.923535] W
[rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-2: cma event
RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1020 peer:10.0.4.1:49174)
>>>> Host Unreachable, Check your connection with IPoIB
>>>> [2015-07-21 21:10:41.883925] I
[rpc-clnt.c:1819:rpc_clnt_reconfig] 0-vol_workdir_amd-client-0: changing port to
49173 (from 0)
>>>> [2015-07-21 21:10:41.887085] I
[rpc-clnt.c:1819:rpc_clnt_reconfig] 0-vol_workdir_amd-client-2: changing port to
49174 (from 0)
>>>> [2015-07-21 21:10:41.919394] W
[rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-0: cma event
RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1021 peer:10.0.4.1:49173)
>>>> Host Unreachable, Check your connection with IPoIB
>>>> [2015-07-21 21:10:41.932622] W
[rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-2: cma event
RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1020 peer:10.0.4.1:49174)
>>>> Host Unreachable, Check your connection with IPoIB
>>>> [2015-07-21 21:10:44.682636] W
[dht-layout.c:189:dht_layout_search] 0-vol_workdir_amd-dht: no subvolume for
hash (value) = 1072520554
>>>> [2015-07-21 21:10:44.682947] W
[dht-layout.c:189:dht_layout_search] 0-vol_workdir_amd-dht: no subvolume for
hash (value) = 1072520554
>>>> [2015-07-21 21:10:44.683240] W
[dht-layout.c:189:dht_layout_search] 0-vol_workdir_amd-dht: no subvolume for
hash (value) = 1072520554
>>>> [2015-07-21 21:10:44.683472] W
[dht-diskusage.c:48:dht_du_info_cbk] 0-vol_workdir_amd-dht: failed to get disk
info from vol_workdir_amd-client-0
>>>> [2015-07-21 21:10:44.683506] W
[dht-diskusage.c:48:dht_du_info_cbk] 0-vol_workdir_amd-dht: failed to get disk
info from vol_workdir_amd-client-2
>>>> [2015-07-21 21:10:44.683532] W
[dht-layout.c:189:dht_layout_search] 0-vol_workdir_amd-dht: no subvolume for
hash (value) = 1072520554
>>>> [2015-07-21 21:10:44.683551] W
[fuse-bridge.c:1970:fuse_create_cbk] 0-glusterfs-fuse: 18: /test => -1 (Aucun
fichier ou dossier de ce type)
>>>> [2015-07-21 21:10:44.683619] W
[dht-layout.c:189:dht_layout_search] 0-vol_workdir_amd-dht: no subvolume for
hash (value) = 1072520554
>>>> [2015-07-21 21:10:44.683846] W
[dht-layout.c:189:dht_layout_search] 0-vol_workdir_amd-dht: no subvolume for
hash (value) = 1072520554
>>>> [2015-07-21 21:10:45.886807] I
[rpc-clnt.c:1819:rpc_clnt_reconfig] 0-vol_workdir_amd-client-0: changing port to
49173 (from 0)
>>>> [2015-07-21 21:10:45.893059] I
[rpc-clnt.c:1819:rpc_clnt_reconfig] 0-vol_workdir_amd-client-2: changing port to
49174 (from 0)
>>>> [2015-07-21 21:10:45.920434] W
[rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-0: cma event
RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1021 peer:10.0.4.1:49173)
>>>> Host Unreachable, Check your connection with IPoIB
>>>> [2015-07-21 21:10:45.925292] W
[rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-2: cma event
RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1020 peer:10.0.4.1:49174)
>>>> Host Unreachable, Check your connection with IPoIB
>>>> 
>>>> I use GlusterFS in production since around 3 years without any
block
>>>> problem but now the situation is awesome since more than 3
weeks?
>>>> Indeed, our production are down since roughly 3.5 weeks (with a
lot
>>>> and different problems with GlusterFS v3.5.3 and now with
3.7.2-3) and
>>>> i need to restart it? 
>>>> 
>>>> Thanks in advance,
>>>> Geoffrey
>>>> ------------------------------------------------------
>>>> Geoffrey Letessier
>>>> Responsable informatique & ing?nieur syst?me
>>>> UPR 9080 - CNRS - Laboratoire de Biochimie Th?orique
>>>> Institut de Biologie Physico-Chimique
>>>> 13, rue Pierre et Marie Curie - 75005 Paris
>>>> Tel: 01 58 41 50 93 - eMail: geoffrey.letessier at ibpc.fr
>>>> 
>>>> Le 21 juil. 2015 ? 19:36, Soumya Koduri <skoduri at
redhat.com> a ?crit :
>>>> 
>>>>> From the following errors,
>>>>> 
>>>>> [2015-07-21 14:36:30.495321] I [MSGID: 114020]
[client.c:2118:notify] 0-vol_shared-client-0: parent translators are ready,
attempting connect on transport
>>>>> [2015-07-21 14:36:30.498989] W
[socket.c:923:__socket_keepalive] 0-socket: failed to set TCP_USER_TIMEOUT 0 on
socket 12, Protocole non disponible
>>>>> [2015-07-21 14:36:30.499004] E
[socket.c:3015:socket_connect] 0-vol_shared-client-0: Failed to set keep-alive:
Protocole non disponible
>>>>> 
>>>>> looks like setting TCP_USER_TIMEOUT value to 0 on the
socket failed with error (IIUC) "Protocol not available".
>>>>> Could you check if 'network.ping-timeout' is set to
zero for that volume using 'gluster volume info'? Anyways from the code
looks like 'TCP_USER_TIMEOUT' can take value zero. Not sure why it has
failed.
>>>>> 
>>>>> Niels, any thoughts?
>>>>> 
>>>>> Thanks,
>>>>> Soumya
>>>>> 
>>>>> On 07/21/2015 08:15 PM, Geoffrey Letessier wrote:
>>>>>> [2015-07-21 14:36:30.495321] I [MSGID: 114020]
[client.c:2118:notify]
>>>>>> 0-vol_shared-client-0: parent translators are ready,
attempting connect
>>>>>> on transport
>>>>>> [2015-07-21 14:36:30.498989] W
[socket.c:923:__socket_keepalive]
>>>>>> 0-socket: failed to set TCP_USER_TIMEOUT 0 on socket
12, Protocole non
>>>>>> disponible
>>>>>> [2015-07-21 14:36:30.499004] E
[socket.c:3015:socket_connect]
>>>>>> 0-vol_shared-client-0: Failed to set keep-alive:
Protocole non disponible
>>>> 
>> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20150722/12ecd264/attachment.html>

Gluster users - Jul 2015 - Change transport-type on volume from tcp to rdma, tcp

[Gluster-users] Change transport-type on volume from tcp to rdma, tcp

[Gluster-users] Change transport-type on volume from tcp to rdma, tcp