thr3ads.net - Gluster users - [Gluster-users] Change transport-type on volume from tcp to rdma, tcp [Jul 2015]

If this information is useful, please help other people find it:
Share via:

Mohammed Rafi K C

2015-Jul-22 08:45 UTC

[Gluster-users] Change transport-type on volume from tcp to rdma, tcp

On 07/22/2015 01:36 PM, Geoffrey Letessier wrote:> Oops, i forgot to add all people in CC.
>
> Yes, i guessed. 
>
> With TCP protocol, all my volume seem OK and I dont note, for the
> moment, any hang.
So if I understand correctly , everything is fine with tcp (no hang, no
transport end point disconnected error),and both happens for rdma.
please correct me if not so.

>
> mount command:
> - with RDMA: mount -t glusterfs -o
> transport=rdma,direct-io-mode=disable,enable-ino32
> ib-storage1:vol_home /mnt
> - with TCP:    mount -t glusterfs -o
> transport=tcp,direct-io-mode=disable,enable-ino32 ib-storage1:vol_home
> /mnt
>
> volume status:
> # gluster volume status all
> Status of volume: vol_home
> Gluster process                             TCP Port  RDMA Port 
> Online  Pid
>
------------------------------------------------------------------------------
> Brick ib-storage1:/export/brick_home/brick1
> /data                                       49159     49165      Y    
>   6547 
> Brick ib-storage2:/export/brick_home/brick1
> /data                                       49161     49173      Y    
>   24348
> Brick ib-storage3:/export/brick_home/brick1
> /data                                       49152     49156      Y    
>   5616 
> Brick ib-storage4:/export/brick_home/brick1
> /data                                       49152     49162      Y    
>   5424 
> Brick ib-storage1:/export/brick_home/brick2
> /data                                       49160     49166      Y    
>   6548 
> Brick ib-storage2:/export/brick_home/brick2
> /data                                       49162     49174      Y    
>   24355
> Brick ib-storage3:/export/brick_home/brick2
> /data                                       49153     49157      Y    
>   5635 
> Brick ib-storage4:/export/brick_home/brick2
> /data                                       49153     49163      Y    
>   5443 
> Self-heal Daemon on localhost               N/A       N/A        Y    
>   6534 
> Self-heal Daemon on ib-storage3             N/A       N/A        Y    
>   7656 
> Self-heal Daemon on ib-storage2             N/A       N/A        Y    
>   24519
> Self-heal Daemon on ib-storage4             N/A       N/A        Y    
>   7288 
>  
> Task Status of Volume vol_home
>
------------------------------------------------------------------------------
> There are no active volume tasks
>  
> Status of volume: vol_shared
> Gluster process                             TCP Port  RDMA Port 
> Online  Pid
>
------------------------------------------------------------------------------
> Brick ib-storage1:/export/brick_shared/data 49152     49164      Y    
>   6554 
> Brick ib-storage2:/export/brick_shared/data 49152     49172      Y    
>   24362
> Self-heal Daemon on localhost               N/A       N/A        Y    
>   6534 
> Self-heal Daemon on ib-storage3             N/A       N/A        Y    
>   7656 
> Self-heal Daemon on ib-storage2             N/A       N/A        Y    
>   24519
> Self-heal Daemon on ib-storage4             N/A       N/A        Y    
>   7288 
>  
> Task Status of Volume vol_shared
>
------------------------------------------------------------------------------
> There are no active volume tasks
>  
> Status of volume: vol_workdir_amd
> Gluster process                             TCP Port  RDMA Port 
> Online  Pid
>
------------------------------------------------------------------------------
> Brick ib-storage1:/export/brick_workdir/bri
> ck1/data                                    49191     49192      Y    
>   6555 
> Brick ib-storage3:/export/brick_workdir/bri
> ck1/data                                    49164     49165      Y    
>   6368 
> Brick ib-storage1:/export/brick_workdir/bri
> ck2/data                                    49193     49194      Y    
>   6576 
> Brick ib-storage3:/export/brick_workdir/bri
> ck2/data                                    49166     49167      Y    
>   6387 
>  
> Task Status of Volume vol_workdir_amd
>
------------------------------------------------------------------------------
> There are no active volume tasks
>  
> Status of volume: vol_workdir_intel
> Gluster process                             TCP Port  RDMA Port 
> Online  Pid
>
------------------------------------------------------------------------------
> Brick ib-storage2:/export/brick_workdir/bri
> ck1/data                                    49175     49176      Y    
>   24371
> Brick ib-storage2:/export/brick_workdir/bri
> ck2/data                                    49177     49178      Y    
>   24372
> Brick ib-storage4:/export/brick_workdir/bri
> ck1/data                                    49164     49165      Y    
>   5571 
> Brick ib-storage4:/export/brick_workdir/bri
> ck2/data                                    49166     49167      Y    
>   5590 
>  
> Task Status of Volume vol_workdir_intel
>
------------------------------------------------------------------------------
> There are no active volume tasks
>
> Concerning the brick logs, do you wanna have all bricks on every servers?any errors from client log and bricks logs, and logs which has message
id in between 102000 to 104000 from the same .

Rafi KC
>
> Geoffrey
>
> ------------------------------------------------------
> Geoffrey Letessier
> Responsable informatique & ing?nieur syst?me
> UPR 9080 - CNRS - Laboratoire de Biochimie Th?orique
> Institut de Biologie Physico-Chimique
> 13, rue Pierre et Marie Curie - 75005 Paris
> Tel: 01 58 41 50 93 - eMail: geoffrey.letessier at ibpc.fr
> <mailto:geoffrey.letessier at ibpc.fr>
>
> Le 22 juil. 2015 ? 10:00, Mohammed Rafi K C <rkavunga at redhat.com
> <mailto:rkavunga at redhat.com>> a ?crit :
>
>>
>>
>> On 07/22/2015 12:55 PM, Geoffrey Letessier wrote:
>>> Concerning the hang, I just saw this only once with TCP protocol
>>> but, actually, RDMA seems to be in cause.
>>
>> If you are mounting a tcp,rdma volume using tcp protocol, all the
>> communication will go through the tcp connection and rdma won't
come
>> in between client and server.
>>
>>> ? And, after a moment (a few minutes after having restarted my
>>> back-transfert of around 40TB), my volume fall down (and all my
>>> rsync too):
>>> [root at atlas ~]# df -h /mnt
>>> df: ? /mnt ?: Noeud final de transport n'est pas connect?
>>> df: aucun syst?me de fichiers trait?
>>> aka "transport endpoint is not connected ?
>>
>> Can you sent me the following details , if possible, ?
>> 1) mount command used, 2) volume status 3) Client, brick logs
>>
>> Regards
>> Rafi KC
>>
>>>
>>> Geoffrey
>>>
>>>
>>> ------------------------------------------------------
>>> Geoffrey Letessier
>>> Responsable informatique & ing?nieur syst?me
>>> UPR 9080 - CNRS - Laboratoire de Biochimie Th?orique
>>> Institut de Biologie Physico-Chimique
>>> 13, rue Pierre et Marie Curie - 75005 Paris
>>> Tel: 01 58 41 50 93 - eMail: geoffrey.letessier at ibpc.fr
>>> <mailto:geoffrey.letessier at ibpc.fr>
>>>
>>> Le 22 juil. 2015 ? 09:17, Geoffrey Letessier
>>> <geoffrey.letessier at cnrs.fr <mailto:geoffrey.letessier at
cnrs.fr>> a
>>> ?crit :
>>>
>>>> Hi Rafi,
>>>>
>>>> It?s what I do. But I note particularly this kind of trouble
when I
>>>> mount my volumes manually.
>>>>
>>>> In addition, when I changed my transport-type from tcp or rdma
to
>>>> tcp,rdma, I have had to restart my volume in order they can
took
>>>> effect. 
>>>>
>>>> I wonder if these trouble are not due to RDMA protocol? because
it
>>>> looks like more stable with TCP one.
>>>>
>>>> Another idea?
>>>> Thanks for replying and by advance,
>>>> Geoffrey
>>>> ------------------------------------------------------
>>>> Geoffrey Letessier
>>>> Responsable informatique & ing?nieur syst?me
>>>> UPR 9080 - CNRS - Laboratoire de Biochimie Th?orique
>>>> Institut de Biologie Physico-Chimique
>>>> 13, rue Pierre et Marie Curie - 75005 Paris
>>>> Tel: 01 58 41 50 93 - eMail: geoffrey.letessier at ibpc.fr
>>>> <mailto:geoffrey.letessier at ibpc.fr>
>>>>
>>>> Le 22 juil. 2015 ? 07:33, Mohammed Rafi K C <rkavunga at
redhat.com
>>>> <mailto:rkavunga at redhat.com>> a ?crit :
>>>>
>>>>>
>>>>>
>>>>> On 07/22/2015 04:51 AM, Geoffrey Letessier wrote:
>>>>>> Hi Niels,
>>>>>>
>>>>>> Thanks for replying. 
>>>>>>
>>>>>> In fact, after having checked the log, I've
discovered GlusterFS
>>>>>> tried to connect a brick with a TCP (or RDMA) port
allocated to
>>>>>> another volume? (bug?)
>>>>>> For example, here is a extract of my workdir.log file :
>>>>>> [2015-07-21 21:34:01.820188] E
>>>>>> [socket.c:2332:socket_connect_finish]
0-vol_workdir_amd-client-0:
>>>>>> connection to 10.0.4.1:49161 failed (Connexion refus?e)
>>>>>> [2015-07-21 21:34:01.822563] E
>>>>>> [socket.c:2332:socket_connect_finish]
0-vol_workdir_amd-client-2:
>>>>>> connection to 10.0.4.1:49162 failed (Connexion refus?e)
>>>>>>
>>>>>> But the 2 ports (49161 and 49162) concerned only my
vol_home
>>>>>> volume, not the vol_workdir_amd one.
>>>>>>
>>>>>> Now, after having restart all glusterd synchronously
(pdsh -w
>>>>>> cl-storage[1-4] service glusterd restart), all seems to
be back
>>>>>> into a normal situation (size, write permission, etc.)
>>>>>>
>>>>>> But, a few minutes later, i note a strange thing I
notice since
>>>>>> i?ve upgraded my cluster storage from 3.5.3 to 3.7.2-3:
when I
>>>>>> try to mount some volume (particularly my vol_shared
volume
>>>>>> (replicated volume)) my system can hang? And, because I
use it in
>>>>>> my bashrc file for my environment modules, i need to
restart my
>>>>>> node. Idem if I try to do a DF on my mounted volume (if
it
>>>>>> doesn?t hang during the mount).
>>>>>>
>>>>>> With TCP transport-type, the situation seems to be more
stable..
>>>>>>
>>>>>> In addition: If I restart a storage node, I can?t use
Gluster CLI
>>>>>> (it also hang).
>>>>>>
>>>>>> Do you have an idea?
>>>>>
>>>>> Are you using bash script to start/mount the volume ? If
so, add a
>>>>> sleep after volume start and mount, to allow all the
process to
>>>>> start properly. Because RDMA protocol will take some time
to init
>>>>> the resources.
>>>>>
>>>>> Regards
>>>>> Rafi KC
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> One more time, thanks a lot for your help,
>>>>>> Geoffrey
>>>>>>
>>>>>> ------------------------------------------------------
>>>>>> Geoffrey Letessier
>>>>>> Responsable informatique & ing?nieur syst?me
>>>>>> UPR 9080 - CNRS - Laboratoire de Biochimie Th?orique
>>>>>> Institut de Biologie Physico-Chimique
>>>>>> 13, rue Pierre et Marie Curie - 75005 Paris
>>>>>> Tel: 01 58 41 50 93 - eMail: geoffrey.letessier at
ibpc.fr
>>>>>> <mailto:geoffrey.letessier at ibpc.fr>
>>>>>>
>>>>>> Le 21 juil. 2015 ? 23:49, Niels de Vos <ndevos at
redhat.com
>>>>>> <mailto:ndevos at redhat.com>> a ?crit :
>>>>>>
>>>>>>> On Tue, Jul 21, 2015 at 11:20:20PM +0200, Geoffrey
Letessier wrote:
>>>>>>>> Hello Soumya, Hello everybody,
>>>>>>>>
>>>>>>>> network.ping-timeout was set to 42 seconds. I
set it to 0 but no
>>>>>>>> difference. The problem was, after having
re-set le
>>>>>>>> transport-type to
>>>>>>>> rdma,tcp some brick down after a few minutes..
Despite of
>>>>>>>> restarting
>>>>>>>> volumes, after a few minutes, some
[other/different] bricks down
>>>>>>>> again.
>>>>>>>
>>>>>>> I'm not sure how if the ping-timeout is
differently handled when
>>>>>>> RDMA is
>>>>>>> used. Adding two of the guys that know RDMA well on
CC.
>>>>>>>
>>>>>>>> Now, after re-creation of my volume, bricks
keep alive but,
>>>>>>>> oddly, i?m
>>>>>>>> not able to write on my volume. In addition, I
defined a
>>>>>>>> distributed
>>>>>>>> volume with 2 servers, 4 bricks of 250GB each
and my final volume
>>>>>>>> seems to be only sized to 500GB? It?s amazing..
>>>>>>>
>>>>>>> As seen further below, the 500GB volume is caused
by two unreachable
>>>>>>> bricks. When the bricks are not reachable, the size
of the
>>>>>>> bricks can
>>>>>>> not be detected by the client and therefore 2x 250
GB is missing.
>>>>>>>
>>>>>>> It is unclear to me why writing to a pure
distributed volume
>>>>>>> fails. When
>>>>>>> a brick is not reachable, and the file should be
created there, it
>>>>>>> would normally get created on an other brick. When
the brick
>>>>>>> that should
>>>>>>> have the file gets online, and a new lookup for the
file is
>>>>>>> done, a so
>>>>>>> called "link file" is created, which
points to the file on the other
>>>>>>> brick. I guess the failure has to do with the
connection issues,
>>>>>>> and I
>>>>>>> would suggest to get that solved first.
>>>>>>>
>>>>>>> HTH,
>>>>>>> Niels
>>>>>>>
>>>>>>>
>>>>>>>> Here you can find some information:
>>>>>>>> # gluster volume status vol_workdir_amd
>>>>>>>> Status of volume: vol_workdir_amd
>>>>>>>> Gluster process                             TCP
Port  RDMA Port
>>>>>>>>  Online  Pid
>>>>>>>>
------------------------------------------------------------------------------
>>>>>>>> Brick ib-storage1:/export/brick_workdir/bri
>>>>>>>> ck1/data                                   
49185     49186
>>>>>>>>      Y       23098
>>>>>>>> Brick ib-storage3:/export/brick_workdir/bri
>>>>>>>> ck1/data                                   
49158     49159
>>>>>>>>      Y       3886
>>>>>>>> Brick ib-storage1:/export/brick_workdir/bri
>>>>>>>> ck2/data                                   
49187     49188
>>>>>>>>      Y       23117
>>>>>>>> Brick ib-storage3:/export/brick_workdir/bri
>>>>>>>> ck2/data                                   
49160     49161
>>>>>>>>      Y       3905
>>>>>>>>
>>>>>>>> # gluster volume info vol_workdir_amd
>>>>>>>>
>>>>>>>> Volume Name: vol_workdir_amd
>>>>>>>> Type: Distribute
>>>>>>>> Volume ID: 087d26ea-c6df-4cbe-94af-ecd87b59aedb
>>>>>>>> Status: Started
>>>>>>>> Number of Bricks: 4
>>>>>>>> Transport-type: tcp,rdma
>>>>>>>> Bricks:
>>>>>>>> Brick1:
ib-storage1:/export/brick_workdir/brick1/data
>>>>>>>> Brick2:
ib-storage3:/export/brick_workdir/brick1/data
>>>>>>>> Brick3:
ib-storage1:/export/brick_workdir/brick2/data
>>>>>>>> Brick4:
ib-storage3:/export/brick_workdir/brick2/data
>>>>>>>> Options Reconfigured:
>>>>>>>> performance.readdir-ahead: on
>>>>>>>>
>>>>>>>> # pdsh -w storage[1,3] df -h
/export/brick_workdir/brick{1,2}
>>>>>>>> storage3: Filesystem            Size  Used
Avail Use% Mounted on
>>>>>>>> storage3: /dev/mapper/st--block1-blk1--workdir
>>>>>>>> storage3:                       250G   34M 
250G   1%
>>>>>>>> /export/brick_workdir/brick1
>>>>>>>> storage3: /dev/mapper/st--block2-blk2--workdir
>>>>>>>> storage3:                       250G   34M 
250G   1%
>>>>>>>> /export/brick_workdir/brick2
>>>>>>>> storage1: Filesystem            Size  Used
Avail Use% Mounted on
>>>>>>>> storage1: /dev/mapper/st--block1-blk1--workdir
>>>>>>>> storage1:                       250G   33M 
250G   1%
>>>>>>>> /export/brick_workdir/brick1
>>>>>>>> storage1: /dev/mapper/st--block2-blk2--workdir
>>>>>>>> storage1:                       250G   33M 
250G   1%
>>>>>>>> /export/brick_workdir/brick2
>>>>>>>>
>>>>>>>> # df -h /workdir/
>>>>>>>> Filesystem            Size  Used Avail Use%
Mounted on
>>>>>>>> localhost:vol_workdir_amd.rdma
>>>>>>>>                      500G   67M  500G   1%
/workdir
>>>>>>>>
>>>>>>>> # touch /workdir/test
>>>>>>>> touch: impossible de faire un touch ?
/workdir/test ?: Aucun
>>>>>>>> fichier ou dossier de ce type
>>>>>>>>
>>>>>>>> # tail -30l /var/log/glusterfs/workdir.log
>>>>>>>> Host Unreachable, Check your connection with
IPoIB
>>>>>>>> [2015-07-21 21:10:33.927673] W
>>>>>>>> [rdma.c:1263:gf_rdma_cm_event_handler]
>>>>>>>> 0-vol_workdir_amd-client-2: cma event
RDMA_CM_EVENT_REJECTED,
>>>>>>>> error 8 (me:10.0.4.1:1020 peer:10.0.4.1:49174)
>>>>>>>> Host Unreachable, Check your connection with
IPoIB
>>>>>>>> [2015-07-21 21:10:37.877231] I
>>>>>>>> [rpc-clnt.c:1819:rpc_clnt_reconfig]
0-vol_workdir_amd-client-0:
>>>>>>>> changing port to 49173 (from 0)
>>>>>>>> [2015-07-21 21:10:37.880556] I
>>>>>>>> [rpc-clnt.c:1819:rpc_clnt_reconfig]
0-vol_workdir_amd-client-2:
>>>>>>>> changing port to 49174 (from 0)
>>>>>>>> [2015-07-21 21:10:37.914661] W
>>>>>>>> [rdma.c:1263:gf_rdma_cm_event_handler]
>>>>>>>> 0-vol_workdir_amd-client-0: cma event
RDMA_CM_EVENT_REJECTED,
>>>>>>>> error 8 (me:10.0.4.1:1021 peer:10.0.4.1:49173)
>>>>>>>> Host Unreachable, Check your connection with
IPoIB
>>>>>>>> [2015-07-21 21:10:37.923535] W
>>>>>>>> [rdma.c:1263:gf_rdma_cm_event_handler]
>>>>>>>> 0-vol_workdir_amd-client-2: cma event
RDMA_CM_EVENT_REJECTED,
>>>>>>>> error 8 (me:10.0.4.1:1020 peer:10.0.4.1:49174)
>>>>>>>> Host Unreachable, Check your connection with
IPoIB
>>>>>>>> [2015-07-21 21:10:41.883925] I
>>>>>>>> [rpc-clnt.c:1819:rpc_clnt_reconfig]
0-vol_workdir_amd-client-0:
>>>>>>>> changing port to 49173 (from 0)
>>>>>>>> [2015-07-21 21:10:41.887085] I
>>>>>>>> [rpc-clnt.c:1819:rpc_clnt_reconfig]
0-vol_workdir_amd-client-2:
>>>>>>>> changing port to 49174 (from 0)
>>>>>>>> [2015-07-21 21:10:41.919394] W
>>>>>>>> [rdma.c:1263:gf_rdma_cm_event_handler]
>>>>>>>> 0-vol_workdir_amd-client-0: cma event
RDMA_CM_EVENT_REJECTED,
>>>>>>>> error 8 (me:10.0.4.1:1021 peer:10.0.4.1:49173)
>>>>>>>> Host Unreachable, Check your connection with
IPoIB
>>>>>>>> [2015-07-21 21:10:41.932622] W
>>>>>>>> [rdma.c:1263:gf_rdma_cm_event_handler]
>>>>>>>> 0-vol_workdir_amd-client-2: cma event
RDMA_CM_EVENT_REJECTED,
>>>>>>>> error 8 (me:10.0.4.1:1020 peer:10.0.4.1:49174)
>>>>>>>> Host Unreachable, Check your connection with
IPoIB
>>>>>>>> [2015-07-21 21:10:44.682636] W
>>>>>>>> [dht-layout.c:189:dht_layout_search]
0-vol_workdir_amd-dht: no
>>>>>>>> subvolume for hash (value) = 1072520554
>>>>>>>> [2015-07-21 21:10:44.682947] W
>>>>>>>> [dht-layout.c:189:dht_layout_search]
0-vol_workdir_amd-dht: no
>>>>>>>> subvolume for hash (value) = 1072520554
>>>>>>>> [2015-07-21 21:10:44.683240] W
>>>>>>>> [dht-layout.c:189:dht_layout_search]
0-vol_workdir_amd-dht: no
>>>>>>>> subvolume for hash (value) = 1072520554
>>>>>>>> [2015-07-21 21:10:44.683472] W
>>>>>>>> [dht-diskusage.c:48:dht_du_info_cbk]
0-vol_workdir_amd-dht:
>>>>>>>> failed to get disk info from
vol_workdir_amd-client-0
>>>>>>>> [2015-07-21 21:10:44.683506] W
>>>>>>>> [dht-diskusage.c:48:dht_du_info_cbk]
0-vol_workdir_amd-dht:
>>>>>>>> failed to get disk info from
vol_workdir_amd-client-2
>>>>>>>> [2015-07-21 21:10:44.683532] W
>>>>>>>> [dht-layout.c:189:dht_layout_search]
0-vol_workdir_amd-dht: no
>>>>>>>> subvolume for hash (value) = 1072520554
>>>>>>>> [2015-07-21 21:10:44.683551] W
>>>>>>>> [fuse-bridge.c:1970:fuse_create_cbk]
0-glusterfs-fuse: 18:
>>>>>>>> /test => -1 (Aucun fichier ou dossier de ce
type)
>>>>>>>> [2015-07-21 21:10:44.683619] W
>>>>>>>> [dht-layout.c:189:dht_layout_search]
0-vol_workdir_amd-dht: no
>>>>>>>> subvolume for hash (value) = 1072520554
>>>>>>>> [2015-07-21 21:10:44.683846] W
>>>>>>>> [dht-layout.c:189:dht_layout_search]
0-vol_workdir_amd-dht: no
>>>>>>>> subvolume for hash (value) = 1072520554
>>>>>>>> [2015-07-21 21:10:45.886807] I
>>>>>>>> [rpc-clnt.c:1819:rpc_clnt_reconfig]
0-vol_workdir_amd-client-0:
>>>>>>>> changing port to 49173 (from 0)
>>>>>>>> [2015-07-21 21:10:45.893059] I
>>>>>>>> [rpc-clnt.c:1819:rpc_clnt_reconfig]
0-vol_workdir_amd-client-2:
>>>>>>>> changing port to 49174 (from 0)
>>>>>>>> [2015-07-21 21:10:45.920434] W
>>>>>>>> [rdma.c:1263:gf_rdma_cm_event_handler]
>>>>>>>> 0-vol_workdir_amd-client-0: cma event
RDMA_CM_EVENT_REJECTED,
>>>>>>>> error 8 (me:10.0.4.1:1021 peer:10.0.4.1:49173)
>>>>>>>> Host Unreachable, Check your connection with
IPoIB
>>>>>>>> [2015-07-21 21:10:45.925292] W
>>>>>>>> [rdma.c:1263:gf_rdma_cm_event_handler]
>>>>>>>> 0-vol_workdir_amd-client-2: cma event
RDMA_CM_EVENT_REJECTED,
>>>>>>>> error 8 (me:10.0.4.1:1020 peer:10.0.4.1:49174)
>>>>>>>> Host Unreachable, Check your connection with
IPoIB
>>>>>>>>
>>>>>>>> I use GlusterFS in production since around 3
years without any
>>>>>>>> block
>>>>>>>> problem but now the situation is awesome since
more than 3 weeks?
>>>>>>>> Indeed, our production are down since roughly
3.5 weeks (with a lot
>>>>>>>> and different problems with GlusterFS v3.5.3
and now with
>>>>>>>> 3.7.2-3) and
>>>>>>>> i need to restart it?
>>>>>>>>
>>>>>>>> Thanks in advance,
>>>>>>>> Geoffrey
>>>>>>>>
------------------------------------------------------
>>>>>>>> Geoffrey Letessier
>>>>>>>> Responsable informatique & ing?nieur
syst?me
>>>>>>>> UPR 9080 - CNRS - Laboratoire de Biochimie
Th?orique
>>>>>>>> Institut de Biologie Physico-Chimique
>>>>>>>> 13, rue Pierre et Marie Curie - 75005 Paris
>>>>>>>> Tel: 01 58 41 50 93 - eMail: geoffrey.letessier
at ibpc.fr
>>>>>>>> <mailto:geoffrey.letessier at ibpc.fr>
>>>>>>>>
>>>>>>>> Le 21 juil. 2015 ? 19:36, Soumya Koduri
<skoduri at redhat.com
>>>>>>>> <mailto:skoduri at redhat.com>> a
?crit :
>>>>>>>>
>>>>>>>>> From the following errors,
>>>>>>>>>
>>>>>>>>> [2015-07-21 14:36:30.495321] I [MSGID:
114020]
>>>>>>>>> [client.c:2118:notify]
0-vol_shared-client-0: parent
>>>>>>>>> translators are ready, attempting connect
on transport
>>>>>>>>> [2015-07-21 14:36:30.498989] W
>>>>>>>>> [socket.c:923:__socket_keepalive] 0-socket:
failed to set
>>>>>>>>> TCP_USER_TIMEOUT 0 on socket 12, Protocole
non disponible
>>>>>>>>> [2015-07-21 14:36:30.499004] E
[socket.c:3015:socket_connect]
>>>>>>>>> 0-vol_shared-client-0: Failed to set
keep-alive: Protocole non
>>>>>>>>> disponible
>>>>>>>>>
>>>>>>>>> looks like setting TCP_USER_TIMEOUT value
to 0 on the socket
>>>>>>>>> failed with error (IIUC) "Protocol not
available".
>>>>>>>>> Could you check if
'network.ping-timeout' is set to zero for
>>>>>>>>> that volume using 'gluster volume
info'? Anyways from the code
>>>>>>>>> looks like 'TCP_USER_TIMEOUT' can
take value zero. Not sure
>>>>>>>>> why it has failed.
>>>>>>>>>
>>>>>>>>> Niels, any thoughts?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Soumya
>>>>>>>>>
>>>>>>>>> On 07/21/2015 08:15 PM, Geoffrey Letessier
wrote:
>>>>>>>>>> [2015-07-21 14:36:30.495321] I [MSGID:
114020]
>>>>>>>>>> [client.c:2118:notify]
>>>>>>>>>> 0-vol_shared-client-0: parent
translators are ready,
>>>>>>>>>> attempting connect
>>>>>>>>>> on transport
>>>>>>>>>> [2015-07-21 14:36:30.498989] W
[socket.c:923:__socket_keepalive]
>>>>>>>>>> 0-socket: failed to set
TCP_USER_TIMEOUT 0 on socket 12,
>>>>>>>>>> Protocole non
>>>>>>>>>> disponible
>>>>>>>>>> [2015-07-21 14:36:30.499004] E
[socket.c:3015:socket_connect]
>>>>>>>>>> 0-vol_shared-client-0: Failed to set
keep-alive: Protocole
>>>>>>>>>> non disponible
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20150722/23a31af9/attachment.html>

Geoffrey Letessier

2015-Jul-22 14:59 UTC

head link

[Gluster-users] Change transport-type on volume from tcp to rdma, tcp

I can confirm your words? Everything looks like OK with TCP proto and
more-or-less unstable with RDMA one. But TCP is slower than RDMA protocol?

In attachments you can find my volume mount log, all brick logs and several
information concerning my vol_shared volume.

Thanks in advance,
Geoffrey

PS: sorry for my next answer latencies but I will be in vacation (from this
evening) very far from any internet access.

------------------------------------------------------
Geoffrey Letessier
Responsable informatique & ing?nieur syst?me
UPR 9080 - CNRS - Laboratoire de Biochimie Th?orique
Institut de Biologie Physico-Chimique
13, rue Pierre et Marie Curie - 75005 Paris
Tel: 01 58 41 50 93 - eMail: geoffrey.letessier at ibpc.fr

Le 22 juil. 2015 ? 10:45, Mohammed Rafi K C <rkavunga at redhat.com> a
?crit :
> 
> 
> On 07/22/2015 01:36 PM, Geoffrey Letessier wrote:
>> Oops, i forgot to add all people in CC.
>> 
>> Yes, i guessed. 
>> 
>> With TCP protocol, all my volume seem OK and I dont note, for the
moment, any hang.
> 
> So if I understand correctly , everything is fine with tcp (no hang, no
transport end point disconnected error),and both happens for rdma. please
correct me if not so.
> 
> 
>> 
>> mount command:
>>  - with RDMA: mount -t glusterfs -o
transport=rdma,direct-io-mode=disable,enable-ino32 ib-storage1:vol_home /mnt
>>  - with TCP:    mount -t glusterfs -o
transport=tcp,direct-io-mode=disable,enable-ino32 ib-storage1:vol_home /mnt
>> 
>> volume status:
>> # gluster volume status all
>> Status of volume: vol_home
>> Gluster process                             TCP Port  RDMA Port  Online
Pid
>>
------------------------------------------------------------------------------
>> Brick ib-storage1:/export/brick_home/brick1
>> /data                                       49159     49165      Y     
6547
>> Brick ib-storage2:/export/brick_home/brick1
>> /data                                       49161     49173      Y     
24348
>> Brick ib-storage3:/export/brick_home/brick1
>> /data                                       49152     49156      Y     
5616
>> Brick ib-storage4:/export/brick_home/brick1
>> /data                                       49152     49162      Y     
5424
>> Brick ib-storage1:/export/brick_home/brick2
>> /data                                       49160     49166      Y     
6548
>> Brick ib-storage2:/export/brick_home/brick2
>> /data                                       49162     49174      Y     
24355
>> Brick ib-storage3:/export/brick_home/brick2
>> /data                                       49153     49157      Y     
5635
>> Brick ib-storage4:/export/brick_home/brick2
>> /data                                       49153     49163      Y     
5443
>> Self-heal Daemon on localhost               N/A       N/A        Y     
6534
>> Self-heal Daemon on ib-storage3             N/A       N/A        Y     
7656
>> Self-heal Daemon on ib-storage2             N/A       N/A        Y     
24519
>> Self-heal Daemon on ib-storage4             N/A       N/A        Y     
7288
>>  
>> Task Status of Volume vol_home
>>
------------------------------------------------------------------------------
>> There are no active volume tasks
>>  
>> Status of volume: vol_shared
>> Gluster process                             TCP Port  RDMA Port  Online
Pid
>>
------------------------------------------------------------------------------
>> Brick ib-storage1:/export/brick_shared/data 49152     49164      Y     
6554
>> Brick ib-storage2:/export/brick_shared/data 49152     49172      Y     
24362
>> Self-heal Daemon on localhost               N/A       N/A        Y     
6534
>> Self-heal Daemon on ib-storage3             N/A       N/A        Y     
7656
>> Self-heal Daemon on ib-storage2             N/A       N/A        Y     
24519
>> Self-heal Daemon on ib-storage4             N/A       N/A        Y     
7288
>>  
>> Task Status of Volume vol_shared
>>
------------------------------------------------------------------------------
>> There are no active volume tasks
>>  
>> Status of volume: vol_workdir_amd
>> Gluster process                             TCP Port  RDMA Port  Online
Pid
>>
------------------------------------------------------------------------------
>> Brick ib-storage1:/export/brick_workdir/bri
>> ck1/data                                    49191     49192      Y     
6555
>> Brick ib-storage3:/export/brick_workdir/bri
>> ck1/data                                    49164     49165      Y     
6368
>> Brick ib-storage1:/export/brick_workdir/bri
>> ck2/data                                    49193     49194      Y     
6576
>> Brick ib-storage3:/export/brick_workdir/bri
>> ck2/data                                    49166     49167      Y     
6387
>>  
>> Task Status of Volume vol_workdir_amd
>>
------------------------------------------------------------------------------
>> There are no active volume tasks
>>  
>> Status of volume: vol_workdir_intel
>> Gluster process                             TCP Port  RDMA Port  Online
Pid
>>
------------------------------------------------------------------------------
>> Brick ib-storage2:/export/brick_workdir/bri
>> ck1/data                                    49175     49176      Y     
24371
>> Brick ib-storage2:/export/brick_workdir/bri
>> ck2/data                                    49177     49178      Y     
24372
>> Brick ib-storage4:/export/brick_workdir/bri
>> ck1/data                                    49164     49165      Y     
5571
>> Brick ib-storage4:/export/brick_workdir/bri
>> ck2/data                                    49166     49167      Y     
5590
>>  
>> Task Status of Volume vol_workdir_intel
>>
------------------------------------------------------------------------------
>> There are no active volume tasks
>> 
>> Concerning the brick logs, do you wanna have all bricks on every
servers?
> any errors from client log and bricks logs, and logs which has message id
in between 102000 to 104000 from the same .
> 
> Rafi KC
> 
>> 
>> Geoffrey
>> 
>> ------------------------------------------------------
>> Geoffrey Letessier
>> Responsable informatique & ing?nieur syst?me
>> UPR 9080 - CNRS - Laboratoire de Biochimie Th?orique
>> Institut de Biologie Physico-Chimique
>> 13, rue Pierre et Marie Curie - 75005 Paris
>> Tel: 01 58 41 50 93 - eMail: geoffrey.letessier at ibpc.fr
>> 
>> Le 22 juil. 2015 ? 10:00, Mohammed Rafi K C <rkavunga at
redhat.com> a ?crit :
>> 
>>> 
>>> 
>>> On 07/22/2015 12:55 PM, Geoffrey Letessier wrote:
>>>> Concerning the hang, I just saw this only once with TCP
protocol but, actually, RDMA seems to be in cause.
>>> 
>>> If you are mounting a tcp,rdma volume using tcp protocol, all the
communication will go through the tcp connection and rdma won't come in
between client and server.
>>> 
>>>> ? And, after a moment (a few minutes after having restarted my
back-transfert of around 40TB), my volume fall down (and all my rsync too):
>>>> [root at atlas ~]# df -h /mnt
>>>> df: ? /mnt ?: Noeud final de transport n'est pas connect?
>>>> df: aucun syst?me de fichiers trait?
>>>> aka "transport endpoint is not connected ?
>>> 
>>> Can you sent me the following details , if possible, ?
>>> 1) mount command used, 2) volume status 3) Client, brick logs 
>>> 
>>> Regards
>>> Rafi KC
>>> 
>>>> 
>>>> Geoffrey
>>>> 
>>>> 
>>>> ------------------------------------------------------
>>>> Geoffrey Letessier
>>>> Responsable informatique & ing?nieur syst?me
>>>> UPR 9080 - CNRS - Laboratoire de Biochimie Th?orique
>>>> Institut de Biologie Physico-Chimique
>>>> 13, rue Pierre et Marie Curie - 75005 Paris
>>>> Tel: 01 58 41 50 93 - eMail: geoffrey.letessier at ibpc.fr
>>>> 
>>>> Le 22 juil. 2015 ? 09:17, Geoffrey Letessier
<geoffrey.letessier at cnrs.fr> a ?crit :
>>>> 
>>>>> Hi Rafi,
>>>>> 
>>>>> It?s what I do. But I note particularly this kind of
trouble when I mount my volumes manually.
>>>>> 
>>>>> In addition, when I changed my transport-type from tcp or
rdma to tcp,rdma, I have had to restart my volume in order they can took effect.
>>>>> 
>>>>> I wonder if these trouble are not due to RDMA protocol?
because it looks like more stable with TCP one.
>>>>> 
>>>>> Another idea?
>>>>> Thanks for replying and by advance,
>>>>> Geoffrey
>>>>> ------------------------------------------------------
>>>>> Geoffrey Letessier
>>>>> Responsable informatique & ing?nieur syst?me
>>>>> UPR 9080 - CNRS - Laboratoire de Biochimie Th?orique
>>>>> Institut de Biologie Physico-Chimique
>>>>> 13, rue Pierre et Marie Curie - 75005 Paris
>>>>> Tel: 01 58 41 50 93 - eMail: geoffrey.letessier at ibpc.fr
>>>>> 
>>>>> Le 22 juil. 2015 ? 07:33, Mohammed Rafi K C <rkavunga at
redhat.com> a ?crit :
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 07/22/2015 04:51 AM, Geoffrey Letessier wrote:
>>>>>>> Hi Niels,
>>>>>>> 
>>>>>>> Thanks for replying. 
>>>>>>> 
>>>>>>> In fact, after having checked the log, I've
discovered GlusterFS tried to connect a brick with a TCP (or RDMA) port
allocated to another volume? (bug?)
>>>>>>> For example, here is a extract of my workdir.log
file :
>>>>>>> [2015-07-21 21:34:01.820188] E
[socket.c:2332:socket_connect_finish] 0-vol_workdir_amd-client-0: connection to
10.0.4.1:49161 failed (Connexion refus?e)
>>>>>>> [2015-07-21 21:34:01.822563] E
[socket.c:2332:socket_connect_finish] 0-vol_workdir_amd-client-2: connection to
10.0.4.1:49162 failed (Connexion refus?e)
>>>>>>> 
>>>>>>> But the 2 ports (49161 and 49162) concerned only my
vol_home volume, not the vol_workdir_amd one.
>>>>>>> 
>>>>>>> Now, after having restart all glusterd
synchronously (pdsh -w cl-storage[1-4] service glusterd restart), all seems to
be back into a normal situation (size, write permission, etc.)
>>>>>>> 
>>>>>>> But, a few minutes later, i note a strange thing I
notice since i?ve upgraded my cluster storage from 3.5.3 to 3.7.2-3: when I try
to mount some volume (particularly my vol_shared volume (replicated volume)) my
system can hang? And, because I use it in my bashrc file for my environment
modules, i need to restart my node. Idem if I try to do a DF on my mounted
volume (if it doesn?t hang during the mount).
>>>>>>> 
>>>>>>> With TCP transport-type, the situation seems to be
more stable..
>>>>>>> 
>>>>>>> In addition: If I restart a storage node, I can?t
use Gluster CLI (it also hang).
>>>>>>> 
>>>>>>> Do you have an idea?
>>>>>> 
>>>>>> Are you using bash script to start/mount the volume ?
If so, add a sleep after volume start and mount, to allow all the process to
start properly. Because RDMA protocol will take some time to init the resources.
>>>>>> 
>>>>>> Regards
>>>>>> Rafi KC
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> One more time, thanks a lot for your help,
>>>>>>> Geoffrey
>>>>>>> 
>>>>>>>
------------------------------------------------------
>>>>>>> Geoffrey Letessier
>>>>>>> Responsable informatique & ing?nieur syst?me
>>>>>>> UPR 9080 - CNRS - Laboratoire de Biochimie
Th?orique
>>>>>>> Institut de Biologie Physico-Chimique
>>>>>>> 13, rue Pierre et Marie Curie - 75005 Paris
>>>>>>> Tel: 01 58 41 50 93 - eMail: geoffrey.letessier at
ibpc.fr
>>>>>>> 
>>>>>>> Le 21 juil. 2015 ? 23:49, Niels de Vos <ndevos
at redhat.com> a ?crit :
>>>>>>> 
>>>>>>>> On Tue, Jul 21, 2015 at 11:20:20PM +0200,
Geoffrey Letessier wrote:
>>>>>>>>> Hello Soumya, Hello everybody,
>>>>>>>>> 
>>>>>>>>> network.ping-timeout was set to 42 seconds.
I set it to 0 but no
>>>>>>>>> difference. The problem was, after having
re-set le transport-type to
>>>>>>>>> rdma,tcp some brick down after a few
minutes.. Despite of restarting
>>>>>>>>> volumes, after a few minutes, some
[other/different] bricks down
>>>>>>>>> again.
>>>>>>>> 
>>>>>>>> I'm not sure how if the ping-timeout is
differently handled when RDMA is
>>>>>>>> used. Adding two of the guys that know RDMA
well on CC.
>>>>>>>> 
>>>>>>>>> Now, after re-creation of my volume, bricks
keep alive but, oddly, i?m
>>>>>>>>> not able to write on my volume. In
addition, I defined a distributed
>>>>>>>>> volume with 2 servers, 4 bricks of 250GB
each and my final volume
>>>>>>>>> seems to be only sized to 500GB? It?s
amazing..
>>>>>>>> 
>>>>>>>> As seen further below, the 500GB volume is
caused by two unreachable
>>>>>>>> bricks. When the bricks are not reachable, the
size of the bricks can
>>>>>>>> not be detected by the client and therefore 2x
250 GB is missing.
>>>>>>>> 
>>>>>>>> It is unclear to me why writing to a pure
distributed volume fails. When
>>>>>>>> a brick is not reachable, and the file should
be created there, it
>>>>>>>> would normally get created on an other brick.
When the brick that should
>>>>>>>> have the file gets online, and a new lookup for
the file is done, a so
>>>>>>>> called "link file" is created, which
points to the file on the other
>>>>>>>> brick. I guess the failure has to do with the
connection issues, and I
>>>>>>>> would suggest to get that solved first.
>>>>>>>> 
>>>>>>>> HTH,
>>>>>>>> Niels
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> Here you can find some information:
>>>>>>>>> # gluster volume status vol_workdir_amd
>>>>>>>>> Status of volume: vol_workdir_amd
>>>>>>>>> Gluster process                            
TCP Port  RDMA Port  Online  Pid
>>>>>>>>>
------------------------------------------------------------------------------
>>>>>>>>> Brick ib-storage1:/export/brick_workdir/bri
>>>>>>>>> ck1/data                                   
49185     49186      Y       23098
>>>>>>>>> Brick ib-storage3:/export/brick_workdir/bri
>>>>>>>>> ck1/data                                   
49158     49159      Y       3886
>>>>>>>>> Brick ib-storage1:/export/brick_workdir/bri
>>>>>>>>> ck2/data                                   
49187     49188      Y       23117
>>>>>>>>> Brick ib-storage3:/export/brick_workdir/bri
>>>>>>>>> ck2/data                                   
49160     49161      Y       3905
>>>>>>>>> 
>>>>>>>>> # gluster volume info vol_workdir_amd
>>>>>>>>> 
>>>>>>>>> Volume Name: vol_workdir_amd
>>>>>>>>> Type: Distribute
>>>>>>>>> Volume ID:
087d26ea-c6df-4cbe-94af-ecd87b59aedb
>>>>>>>>> Status: Started
>>>>>>>>> Number of Bricks: 4
>>>>>>>>> Transport-type: tcp,rdma
>>>>>>>>> Bricks:
>>>>>>>>> Brick1:
ib-storage1:/export/brick_workdir/brick1/data
>>>>>>>>> Brick2:
ib-storage3:/export/brick_workdir/brick1/data
>>>>>>>>> Brick3:
ib-storage1:/export/brick_workdir/brick2/data
>>>>>>>>> Brick4:
ib-storage3:/export/brick_workdir/brick2/data
>>>>>>>>> Options Reconfigured:
>>>>>>>>> performance.readdir-ahead: on
>>>>>>>>> 
>>>>>>>>> # pdsh -w storage[1,3] df -h
/export/brick_workdir/brick{1,2}
>>>>>>>>> storage3: Filesystem            Size  Used
Avail Use% Mounted on
>>>>>>>>> storage3:
/dev/mapper/st--block1-blk1--workdir
>>>>>>>>> storage3:                       250G   34M 
250G   1% /export/brick_workdir/brick1
>>>>>>>>> storage3:
/dev/mapper/st--block2-blk2--workdir
>>>>>>>>> storage3:                       250G   34M 
250G   1% /export/brick_workdir/brick2
>>>>>>>>> storage1: Filesystem            Size  Used
Avail Use% Mounted on
>>>>>>>>> storage1:
/dev/mapper/st--block1-blk1--workdir
>>>>>>>>> storage1:                       250G   33M 
250G   1% /export/brick_workdir/brick1
>>>>>>>>> storage1:
/dev/mapper/st--block2-blk2--workdir
>>>>>>>>> storage1:                       250G   33M 
250G   1% /export/brick_workdir/brick2
>>>>>>>>> 
>>>>>>>>> # df -h /workdir/
>>>>>>>>> Filesystem            Size  Used Avail Use%
Mounted on
>>>>>>>>> localhost:vol_workdir_amd.rdma
>>>>>>>>>                      500G   67M  500G   1%
/workdir
>>>>>>>>> 
>>>>>>>>> # touch /workdir/test
>>>>>>>>> touch: impossible de faire un touch ?
/workdir/test ?: Aucun fichier ou dossier de ce type
>>>>>>>>> 
>>>>>>>>> # tail -30l /var/log/glusterfs/workdir.log 
>>>>>>>>> Host Unreachable, Check your connection
with IPoIB
>>>>>>>>> [2015-07-21 21:10:33.927673] W
[rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-2: cma event
RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1020 peer:10.0.4.1:49174)
>>>>>>>>> Host Unreachable, Check your connection
with IPoIB
>>>>>>>>> [2015-07-21 21:10:37.877231] I
[rpc-clnt.c:1819:rpc_clnt_reconfig] 0-vol_workdir_amd-client-0: changing port to
49173 (from 0)
>>>>>>>>> [2015-07-21 21:10:37.880556] I
[rpc-clnt.c:1819:rpc_clnt_reconfig] 0-vol_workdir_amd-client-2: changing port to
49174 (from 0)
>>>>>>>>> [2015-07-21 21:10:37.914661] W
[rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-0: cma event
RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1021 peer:10.0.4.1:49173)
>>>>>>>>> Host Unreachable, Check your connection
with IPoIB
>>>>>>>>> [2015-07-21 21:10:37.923535] W
[rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-2: cma event
RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1020 peer:10.0.4.1:49174)
>>>>>>>>> Host Unreachable, Check your connection
with IPoIB
>>>>>>>>> [2015-07-21 21:10:41.883925] I
[rpc-clnt.c:1819:rpc_clnt_reconfig] 0-vol_workdir_amd-client-0: changing port to
49173 (from 0)
>>>>>>>>> [2015-07-21 21:10:41.887085] I
[rpc-clnt.c:1819:rpc_clnt_reconfig] 0-vol_workdir_amd-client-2: changing port to
49174 (from 0)
>>>>>>>>> [2015-07-21 21:10:41.919394] W
[rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-0: cma event
RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1021 peer:10.0.4.1:49173)
>>>>>>>>> Host Unreachable, Check your connection
with IPoIB
>>>>>>>>> [2015-07-21 21:10:41.932622] W
[rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-2: cma event
RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1020 peer:10.0.4.1:49174)
>>>>>>>>> Host Unreachable, Check your connection
with IPoIB
>>>>>>>>> [2015-07-21 21:10:44.682636] W
[dht-layout.c:189:dht_layout_search] 0-vol_workdir_amd-dht: no subvolume for
hash (value) = 1072520554
>>>>>>>>> [2015-07-21 21:10:44.682947] W
[dht-layout.c:189:dht_layout_search] 0-vol_workdir_amd-dht: no subvolume for
hash (value) = 1072520554
>>>>>>>>> [2015-07-21 21:10:44.683240] W
[dht-layout.c:189:dht_layout_search] 0-vol_workdir_amd-dht: no subvolume for
hash (value) = 1072520554
>>>>>>>>> [2015-07-21 21:10:44.683472] W
[dht-diskusage.c:48:dht_du_info_cbk] 0-vol_workdir_amd-dht: failed to get disk
info from vol_workdir_amd-client-0
>>>>>>>>> [2015-07-21 21:10:44.683506] W
[dht-diskusage.c:48:dht_du_info_cbk] 0-vol_workdir_amd-dht: failed to get disk
info from vol_workdir_amd-client-2
>>>>>>>>> [2015-07-21 21:10:44.683532] W
[dht-layout.c:189:dht_layout_search] 0-vol_workdir_amd-dht: no subvolume for
hash (value) = 1072520554
>>>>>>>>> [2015-07-21 21:10:44.683551] W
[fuse-bridge.c:1970:fuse_create_cbk] 0-glusterfs-fuse: 18: /test => -1 (Aucun
fichier ou dossier de ce type)
>>>>>>>>> [2015-07-21 21:10:44.683619] W
[dht-layout.c:189:dht_layout_search] 0-vol_workdir_amd-dht: no subvolume for
hash (value) = 1072520554
>>>>>>>>> [2015-07-21 21:10:44.683846] W
[dht-layout.c:189:dht_layout_search] 0-vol_workdir_amd-dht: no subvolume for
hash (value) = 1072520554
>>>>>>>>> [2015-07-21 21:10:45.886807] I
[rpc-clnt.c:1819:rpc_clnt_reconfig] 0-vol_workdir_amd-client-0: changing port to
49173 (from 0)
>>>>>>>>> [2015-07-21 21:10:45.893059] I
[rpc-clnt.c:1819:rpc_clnt_reconfig] 0-vol_workdir_amd-client-2: changing port to
49174 (from 0)
>>>>>>>>> [2015-07-21 21:10:45.920434] W
[rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-0: cma event
RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1021 peer:10.0.4.1:49173)
>>>>>>>>> Host Unreachable, Check your connection
with IPoIB
>>>>>>>>> [2015-07-21 21:10:45.925292] W
[rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-2: cma event
RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1020 peer:10.0.4.1:49174)
>>>>>>>>> Host Unreachable, Check your connection
with IPoIB
>>>>>>>>> 
>>>>>>>>> I use GlusterFS in production since around
3 years without any block
>>>>>>>>> problem but now the situation is awesome
since more than 3 weeks?
>>>>>>>>> Indeed, our production are down since
roughly 3.5 weeks (with a lot
>>>>>>>>> and different problems with GlusterFS
v3.5.3 and now with 3.7.2-3) and
>>>>>>>>> i need to restart it? 
>>>>>>>>> 
>>>>>>>>> Thanks in advance,
>>>>>>>>> Geoffrey
>>>>>>>>>
------------------------------------------------------
>>>>>>>>> Geoffrey Letessier
>>>>>>>>> Responsable informatique & ing?nieur
syst?me
>>>>>>>>> UPR 9080 - CNRS - Laboratoire de Biochimie
Th?orique
>>>>>>>>> Institut de Biologie Physico-Chimique
>>>>>>>>> 13, rue Pierre et Marie Curie - 75005 Paris
>>>>>>>>> Tel: 01 58 41 50 93 - eMail:
geoffrey.letessier at ibpc.fr
>>>>>>>>> 
>>>>>>>>> Le 21 juil. 2015 ? 19:36, Soumya Koduri
<skoduri at redhat.com> a ?crit :
>>>>>>>>> 
>>>>>>>>>> From the following errors,
>>>>>>>>>> 
>>>>>>>>>> [2015-07-21 14:36:30.495321] I [MSGID:
114020] [client.c:2118:notify] 0-vol_shared-client-0: parent translators are
ready, attempting connect on transport
>>>>>>>>>> [2015-07-21 14:36:30.498989] W
[socket.c:923:__socket_keepalive] 0-socket: failed to set TCP_USER_TIMEOUT 0 on
socket 12, Protocole non disponible
>>>>>>>>>> [2015-07-21 14:36:30.499004] E
[socket.c:3015:socket_connect] 0-vol_shared-client-0: Failed to set keep-alive:
Protocole non disponible
>>>>>>>>>> 
>>>>>>>>>> looks like setting TCP_USER_TIMEOUT
value to 0 on the socket failed with error (IIUC) "Protocol not
available".
>>>>>>>>>> Could you check if
'network.ping-timeout' is set to zero for that volume using 'gluster
volume info'? Anyways from the code looks like 'TCP_USER_TIMEOUT'
can take value zero. Not sure why it has failed.
>>>>>>>>>> 
>>>>>>>>>> Niels, any thoughts?
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Soumya
>>>>>>>>>> 
>>>>>>>>>> On 07/21/2015 08:15 PM, Geoffrey
Letessier wrote:
>>>>>>>>>>> [2015-07-21 14:36:30.495321] I
[MSGID: 114020] [client.c:2118:notify]
>>>>>>>>>>> 0-vol_shared-client-0: parent
translators are ready, attempting connect
>>>>>>>>>>> on transport
>>>>>>>>>>> [2015-07-21 14:36:30.498989] W
[socket.c:923:__socket_keepalive]
>>>>>>>>>>> 0-socket: failed to set
TCP_USER_TIMEOUT 0 on socket 12, Protocole non
>>>>>>>>>>> disponible
>>>>>>>>>>> [2015-07-21 14:36:30.499004] E
[socket.c:3015:socket_connect]
>>>>>>>>>>> 0-vol_shared-client-0: Failed to
set keep-alive: Protocole non disponible
>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20150722/fa1d29e6/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: vol_shared.tgz
Type: application/octet-stream
Size: 13800 bytes
Desc: not available
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20150722/fa1d29e6/attachment.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20150722/fa1d29e6/attachment-0001.html>

Gluster users - Jul 2015 - Change transport-type on volume from tcp to rdma, tcp

[Gluster-users] Change transport-type on volume from tcp to rdma, tcp

[Gluster-users] Change transport-type on volume from tcp to rdma, tcp