Mohammed Rafi K C
2015-Jul-22 05:33 UTC
[Gluster-users] Change transport-type on volume from tcp to rdma, tcp
On 07/22/2015 04:51 AM, Geoffrey Letessier wrote:> Hi Niels, > > Thanks for replying. > > In fact, after having checked the log, I've discovered GlusterFS tried > to connect a brick with a TCP (or RDMA) port allocated to another > volume? (bug?) > For example, here is a extract of my workdir.log file : > [2015-07-21 21:34:01.820188] E [socket.c:2332:socket_connect_finish] > 0-vol_workdir_amd-client-0: connection to 10.0.4.1:49161 failed > (Connexion refus?e) > [2015-07-21 21:34:01.822563] E [socket.c:2332:socket_connect_finish] > 0-vol_workdir_amd-client-2: connection to 10.0.4.1:49162 failed > (Connexion refus?e) > > But the 2 ports (49161 and 49162) concerned only my vol_home volume, > not the vol_workdir_amd one. > > Now, after having restart all glusterd synchronously (pdsh -w > cl-storage[1-4] service glusterd restart), all seems to be back into a > normal situation (size, write permission, etc.) > > But, a few minutes later, i note a strange thing I notice since i?ve > upgraded my cluster storage from 3.5.3 to 3.7.2-3: when I try to mount > some volume (particularly my vol_shared volume (replicated volume)) my > system can hang? And, because I use it in my bashrc file for my > environment modules, i need to restart my node. Idem if I try to do a > DF on my mounted volume (if it doesn?t hang during the mount). > > With TCP transport-type, the situation seems to be more stable.. > > In addition: If I restart a storage node, I can?t use Gluster CLI (it > also hang). > > Do you have an idea?Are you using bash script to start/mount the volume ? If so, add a sleep after volume start and mount, to allow all the process to start properly. Because RDMA protocol will take some time to init the resources. Regards Rafi KC> > One more time, thanks a lot for your help, > Geoffrey > > ------------------------------------------------------ > Geoffrey Letessier > Responsable informatique & ing?nieur syst?me > UPR 9080 - CNRS - Laboratoire de Biochimie Th?orique > Institut de Biologie Physico-Chimique > 13, rue Pierre et Marie Curie - 75005 Paris > Tel: 01 58 41 50 93 - eMail: geoffrey.letessier at ibpc.fr > <mailto:geoffrey.letessier at ibpc.fr> > > Le 21 juil. 2015 ? 23:49, Niels de Vos <ndevos at redhat.com > <mailto:ndevos at redhat.com>> a ?crit : > >> On Tue, Jul 21, 2015 at 11:20:20PM +0200, Geoffrey Letessier wrote: >>> Hello Soumya, Hello everybody, >>> >>> network.ping-timeout was set to 42 seconds. I set it to 0 but no >>> difference. The problem was, after having re-set le transport-type to >>> rdma,tcp some brick down after a few minutes.. Despite of restarting >>> volumes, after a few minutes, some [other/different] bricks down >>> again. >> >> I'm not sure how if the ping-timeout is differently handled when RDMA is >> used. Adding two of the guys that know RDMA well on CC. >> >>> Now, after re-creation of my volume, bricks keep alive but, oddly, i?m >>> not able to write on my volume. In addition, I defined a distributed >>> volume with 2 servers, 4 bricks of 250GB each and my final volume >>> seems to be only sized to 500GB? It?s amazing.. >> >> As seen further below, the 500GB volume is caused by two unreachable >> bricks. When the bricks are not reachable, the size of the bricks can >> not be detected by the client and therefore 2x 250 GB is missing. >> >> It is unclear to me why writing to a pure distributed volume fails. When >> a brick is not reachable, and the file should be created there, it >> would normally get created on an other brick. When the brick that should >> have the file gets online, and a new lookup for the file is done, a so >> called "link file" is created, which points to the file on the other >> brick. I guess the failure has to do with the connection issues, and I >> would suggest to get that solved first. >> >> HTH, >> Niels >> >> >>> Here you can find some information: >>> # gluster volume status vol_workdir_amd >>> Status of volume: vol_workdir_amd >>> Gluster process TCP Port RDMA Port >>> Online Pid >>> ------------------------------------------------------------------------------ >>> Brick ib-storage1:/export/brick_workdir/bri >>> ck1/data 49185 49186 Y >>> 23098 >>> Brick ib-storage3:/export/brick_workdir/bri >>> ck1/data 49158 49159 Y >>> 3886 >>> Brick ib-storage1:/export/brick_workdir/bri >>> ck2/data 49187 49188 Y >>> 23117 >>> Brick ib-storage3:/export/brick_workdir/bri >>> ck2/data 49160 49161 Y >>> 3905 >>> >>> # gluster volume info vol_workdir_amd >>> >>> Volume Name: vol_workdir_amd >>> Type: Distribute >>> Volume ID: 087d26ea-c6df-4cbe-94af-ecd87b59aedb >>> Status: Started >>> Number of Bricks: 4 >>> Transport-type: tcp,rdma >>> Bricks: >>> Brick1: ib-storage1:/export/brick_workdir/brick1/data >>> Brick2: ib-storage3:/export/brick_workdir/brick1/data >>> Brick3: ib-storage1:/export/brick_workdir/brick2/data >>> Brick4: ib-storage3:/export/brick_workdir/brick2/data >>> Options Reconfigured: >>> performance.readdir-ahead: on >>> >>> # pdsh -w storage[1,3] df -h /export/brick_workdir/brick{1,2} >>> storage3: Filesystem Size Used Avail Use% Mounted on >>> storage3: /dev/mapper/st--block1-blk1--workdir >>> storage3: 250G 34M 250G 1% >>> /export/brick_workdir/brick1 >>> storage3: /dev/mapper/st--block2-blk2--workdir >>> storage3: 250G 34M 250G 1% >>> /export/brick_workdir/brick2 >>> storage1: Filesystem Size Used Avail Use% Mounted on >>> storage1: /dev/mapper/st--block1-blk1--workdir >>> storage1: 250G 33M 250G 1% >>> /export/brick_workdir/brick1 >>> storage1: /dev/mapper/st--block2-blk2--workdir >>> storage1: 250G 33M 250G 1% >>> /export/brick_workdir/brick2 >>> >>> # df -h /workdir/ >>> Filesystem Size Used Avail Use% Mounted on >>> localhost:vol_workdir_amd.rdma >>> 500G 67M 500G 1% /workdir >>> >>> # touch /workdir/test >>> touch: impossible de faire un touch ? /workdir/test ?: Aucun fichier >>> ou dossier de ce type >>> >>> # tail -30l /var/log/glusterfs/workdir.log >>> Host Unreachable, Check your connection with IPoIB >>> [2015-07-21 21:10:33.927673] W >>> [rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-2: >>> cma event RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1020 >>> peer:10.0.4.1:49174) >>> Host Unreachable, Check your connection with IPoIB >>> [2015-07-21 21:10:37.877231] I [rpc-clnt.c:1819:rpc_clnt_reconfig] >>> 0-vol_workdir_amd-client-0: changing port to 49173 (from 0) >>> [2015-07-21 21:10:37.880556] I [rpc-clnt.c:1819:rpc_clnt_reconfig] >>> 0-vol_workdir_amd-client-2: changing port to 49174 (from 0) >>> [2015-07-21 21:10:37.914661] W >>> [rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-0: >>> cma event RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1021 >>> peer:10.0.4.1:49173) >>> Host Unreachable, Check your connection with IPoIB >>> [2015-07-21 21:10:37.923535] W >>> [rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-2: >>> cma event RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1020 >>> peer:10.0.4.1:49174) >>> Host Unreachable, Check your connection with IPoIB >>> [2015-07-21 21:10:41.883925] I [rpc-clnt.c:1819:rpc_clnt_reconfig] >>> 0-vol_workdir_amd-client-0: changing port to 49173 (from 0) >>> [2015-07-21 21:10:41.887085] I [rpc-clnt.c:1819:rpc_clnt_reconfig] >>> 0-vol_workdir_amd-client-2: changing port to 49174 (from 0) >>> [2015-07-21 21:10:41.919394] W >>> [rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-0: >>> cma event RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1021 >>> peer:10.0.4.1:49173) >>> Host Unreachable, Check your connection with IPoIB >>> [2015-07-21 21:10:41.932622] W >>> [rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-2: >>> cma event RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1020 >>> peer:10.0.4.1:49174) >>> Host Unreachable, Check your connection with IPoIB >>> [2015-07-21 21:10:44.682636] W [dht-layout.c:189:dht_layout_search] >>> 0-vol_workdir_amd-dht: no subvolume for hash (value) = 1072520554 >>> [2015-07-21 21:10:44.682947] W [dht-layout.c:189:dht_layout_search] >>> 0-vol_workdir_amd-dht: no subvolume for hash (value) = 1072520554 >>> [2015-07-21 21:10:44.683240] W [dht-layout.c:189:dht_layout_search] >>> 0-vol_workdir_amd-dht: no subvolume for hash (value) = 1072520554 >>> [2015-07-21 21:10:44.683472] W [dht-diskusage.c:48:dht_du_info_cbk] >>> 0-vol_workdir_amd-dht: failed to get disk info from >>> vol_workdir_amd-client-0 >>> [2015-07-21 21:10:44.683506] W [dht-diskusage.c:48:dht_du_info_cbk] >>> 0-vol_workdir_amd-dht: failed to get disk info from >>> vol_workdir_amd-client-2 >>> [2015-07-21 21:10:44.683532] W [dht-layout.c:189:dht_layout_search] >>> 0-vol_workdir_amd-dht: no subvolume for hash (value) = 1072520554 >>> [2015-07-21 21:10:44.683551] W [fuse-bridge.c:1970:fuse_create_cbk] >>> 0-glusterfs-fuse: 18: /test => -1 (Aucun fichier ou dossier de ce type) >>> [2015-07-21 21:10:44.683619] W [dht-layout.c:189:dht_layout_search] >>> 0-vol_workdir_amd-dht: no subvolume for hash (value) = 1072520554 >>> [2015-07-21 21:10:44.683846] W [dht-layout.c:189:dht_layout_search] >>> 0-vol_workdir_amd-dht: no subvolume for hash (value) = 1072520554 >>> [2015-07-21 21:10:45.886807] I [rpc-clnt.c:1819:rpc_clnt_reconfig] >>> 0-vol_workdir_amd-client-0: changing port to 49173 (from 0) >>> [2015-07-21 21:10:45.893059] I [rpc-clnt.c:1819:rpc_clnt_reconfig] >>> 0-vol_workdir_amd-client-2: changing port to 49174 (from 0) >>> [2015-07-21 21:10:45.920434] W >>> [rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-0: >>> cma event RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1021 >>> peer:10.0.4.1:49173) >>> Host Unreachable, Check your connection with IPoIB >>> [2015-07-21 21:10:45.925292] W >>> [rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-2: >>> cma event RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1020 >>> peer:10.0.4.1:49174) >>> Host Unreachable, Check your connection with IPoIB >>> >>> I use GlusterFS in production since around 3 years without any block >>> problem but now the situation is awesome since more than 3 weeks? >>> Indeed, our production are down since roughly 3.5 weeks (with a lot >>> and different problems with GlusterFS v3.5.3 and now with 3.7.2-3) and >>> i need to restart it? >>> >>> Thanks in advance, >>> Geoffrey >>> ------------------------------------------------------ >>> Geoffrey Letessier >>> Responsable informatique & ing?nieur syst?me >>> UPR 9080 - CNRS - Laboratoire de Biochimie Th?orique >>> Institut de Biologie Physico-Chimique >>> 13, rue Pierre et Marie Curie - 75005 Paris >>> Tel: 01 58 41 50 93 - eMail: geoffrey.letessier at ibpc.fr >>> <mailto:geoffrey.letessier at ibpc.fr> >>> >>> Le 21 juil. 2015 ? 19:36, Soumya Koduri <skoduri at redhat.com >>> <mailto:skoduri at redhat.com>> a ?crit : >>> >>>> From the following errors, >>>> >>>> [2015-07-21 14:36:30.495321] I [MSGID: 114020] >>>> [client.c:2118:notify] 0-vol_shared-client-0: parent translators >>>> are ready, attempting connect on transport >>>> [2015-07-21 14:36:30.498989] W [socket.c:923:__socket_keepalive] >>>> 0-socket: failed to set TCP_USER_TIMEOUT 0 on socket 12, Protocole >>>> non disponible >>>> [2015-07-21 14:36:30.499004] E [socket.c:3015:socket_connect] >>>> 0-vol_shared-client-0: Failed to set keep-alive: Protocole non >>>> disponible >>>> >>>> looks like setting TCP_USER_TIMEOUT value to 0 on the socket failed >>>> with error (IIUC) "Protocol not available". >>>> Could you check if 'network.ping-timeout' is set to zero for that >>>> volume using 'gluster volume info'? Anyways from the code looks >>>> like 'TCP_USER_TIMEOUT' can take value zero. Not sure why it has >>>> failed. >>>> >>>> Niels, any thoughts? >>>> >>>> Thanks, >>>> Soumya >>>> >>>> On 07/21/2015 08:15 PM, Geoffrey Letessier wrote: >>>>> [2015-07-21 14:36:30.495321] I [MSGID: 114020] [client.c:2118:notify] >>>>> 0-vol_shared-client-0: parent translators are ready, attempting >>>>> connect >>>>> on transport >>>>> [2015-07-21 14:36:30.498989] W [socket.c:923:__socket_keepalive] >>>>> 0-socket: failed to set TCP_USER_TIMEOUT 0 on socket 12, Protocole non >>>>> disponible >>>>> [2015-07-21 14:36:30.499004] E [socket.c:3015:socket_connect] >>>>> 0-vol_shared-client-0: Failed to set keep-alive: Protocole non >>>>> disponible >>> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20150722/f570b7f8/attachment.html>
Geoffrey Letessier
2015-Jul-22 07:17 UTC
[Gluster-users] Change transport-type on volume from tcp to rdma, tcp
Hi Rafi, It?s what I do. But I note particularly this kind of trouble when I mount my volumes manually. In addition, when I changed my transport-type from tcp or rdma to tcp,rdma, I have had to restart my volume in order they can took effect. I wonder if these trouble are not due to RDMA protocol? because it looks like more stable with TCP one. Another idea? Thanks for replying and by advance, Geoffrey ------------------------------------------------------ Geoffrey Letessier Responsable informatique & ing?nieur syst?me UPR 9080 - CNRS - Laboratoire de Biochimie Th?orique Institut de Biologie Physico-Chimique 13, rue Pierre et Marie Curie - 75005 Paris Tel: 01 58 41 50 93 - eMail: geoffrey.letessier at ibpc.fr Le 22 juil. 2015 ? 07:33, Mohammed Rafi K C <rkavunga at redhat.com> a ?crit :> > > On 07/22/2015 04:51 AM, Geoffrey Letessier wrote: >> Hi Niels, >> >> Thanks for replying. >> >> In fact, after having checked the log, I've discovered GlusterFS tried to connect a brick with a TCP (or RDMA) port allocated to another volume? (bug?) >> For example, here is a extract of my workdir.log file : >> [2015-07-21 21:34:01.820188] E [socket.c:2332:socket_connect_finish] 0-vol_workdir_amd-client-0: connection to 10.0.4.1:49161 failed (Connexion refus?e) >> [2015-07-21 21:34:01.822563] E [socket.c:2332:socket_connect_finish] 0-vol_workdir_amd-client-2: connection to 10.0.4.1:49162 failed (Connexion refus?e) >> >> But the 2 ports (49161 and 49162) concerned only my vol_home volume, not the vol_workdir_amd one. >> >> Now, after having restart all glusterd synchronously (pdsh -w cl-storage[1-4] service glusterd restart), all seems to be back into a normal situation (size, write permission, etc.) >> >> But, a few minutes later, i note a strange thing I notice since i?ve upgraded my cluster storage from 3.5.3 to 3.7.2-3: when I try to mount some volume (particularly my vol_shared volume (replicated volume)) my system can hang? And, because I use it in my bashrc file for my environment modules, i need to restart my node. Idem if I try to do a DF on my mounted volume (if it doesn?t hang during the mount). >> >> With TCP transport-type, the situation seems to be more stable.. >> >> In addition: If I restart a storage node, I can?t use Gluster CLI (it also hang). >> >> Do you have an idea? > > Are you using bash script to start/mount the volume ? If so, add a sleep after volume start and mount, to allow all the process to start properly. Because RDMA protocol will take some time to init the resources. > > Regards > Rafi KC > > > >> >> One more time, thanks a lot for your help, >> Geoffrey >> >> ------------------------------------------------------ >> Geoffrey Letessier >> Responsable informatique & ing?nieur syst?me >> UPR 9080 - CNRS - Laboratoire de Biochimie Th?orique >> Institut de Biologie Physico-Chimique >> 13, rue Pierre et Marie Curie - 75005 Paris >> Tel: 01 58 41 50 93 - eMail: geoffrey.letessier at ibpc.fr >> >> Le 21 juil. 2015 ? 23:49, Niels de Vos <ndevos at redhat.com> a ?crit : >> >>> On Tue, Jul 21, 2015 at 11:20:20PM +0200, Geoffrey Letessier wrote: >>>> Hello Soumya, Hello everybody, >>>> >>>> network.ping-timeout was set to 42 seconds. I set it to 0 but no >>>> difference. The problem was, after having re-set le transport-type to >>>> rdma,tcp some brick down after a few minutes.. Despite of restarting >>>> volumes, after a few minutes, some [other/different] bricks down >>>> again. >>> >>> I'm not sure how if the ping-timeout is differently handled when RDMA is >>> used. Adding two of the guys that know RDMA well on CC. >>> >>>> Now, after re-creation of my volume, bricks keep alive but, oddly, i?m >>>> not able to write on my volume. In addition, I defined a distributed >>>> volume with 2 servers, 4 bricks of 250GB each and my final volume >>>> seems to be only sized to 500GB? It?s amazing.. >>> >>> As seen further below, the 500GB volume is caused by two unreachable >>> bricks. When the bricks are not reachable, the size of the bricks can >>> not be detected by the client and therefore 2x 250 GB is missing. >>> >>> It is unclear to me why writing to a pure distributed volume fails. When >>> a brick is not reachable, and the file should be created there, it >>> would normally get created on an other brick. When the brick that should >>> have the file gets online, and a new lookup for the file is done, a so >>> called "link file" is created, which points to the file on the other >>> brick. I guess the failure has to do with the connection issues, and I >>> would suggest to get that solved first. >>> >>> HTH, >>> Niels >>> >>> >>>> Here you can find some information: >>>> # gluster volume status vol_workdir_amd >>>> Status of volume: vol_workdir_amd >>>> Gluster process TCP Port RDMA Port Online Pid >>>> ------------------------------------------------------------------------------ >>>> Brick ib-storage1:/export/brick_workdir/bri >>>> ck1/data 49185 49186 Y 23098 >>>> Brick ib-storage3:/export/brick_workdir/bri >>>> ck1/data 49158 49159 Y 3886 >>>> Brick ib-storage1:/export/brick_workdir/bri >>>> ck2/data 49187 49188 Y 23117 >>>> Brick ib-storage3:/export/brick_workdir/bri >>>> ck2/data 49160 49161 Y 3905 >>>> >>>> # gluster volume info vol_workdir_amd >>>> >>>> Volume Name: vol_workdir_amd >>>> Type: Distribute >>>> Volume ID: 087d26ea-c6df-4cbe-94af-ecd87b59aedb >>>> Status: Started >>>> Number of Bricks: 4 >>>> Transport-type: tcp,rdma >>>> Bricks: >>>> Brick1: ib-storage1:/export/brick_workdir/brick1/data >>>> Brick2: ib-storage3:/export/brick_workdir/brick1/data >>>> Brick3: ib-storage1:/export/brick_workdir/brick2/data >>>> Brick4: ib-storage3:/export/brick_workdir/brick2/data >>>> Options Reconfigured: >>>> performance.readdir-ahead: on >>>> >>>> # pdsh -w storage[1,3] df -h /export/brick_workdir/brick{1,2} >>>> storage3: Filesystem Size Used Avail Use% Mounted on >>>> storage3: /dev/mapper/st--block1-blk1--workdir >>>> storage3: 250G 34M 250G 1% /export/brick_workdir/brick1 >>>> storage3: /dev/mapper/st--block2-blk2--workdir >>>> storage3: 250G 34M 250G 1% /export/brick_workdir/brick2 >>>> storage1: Filesystem Size Used Avail Use% Mounted on >>>> storage1: /dev/mapper/st--block1-blk1--workdir >>>> storage1: 250G 33M 250G 1% /export/brick_workdir/brick1 >>>> storage1: /dev/mapper/st--block2-blk2--workdir >>>> storage1: 250G 33M 250G 1% /export/brick_workdir/brick2 >>>> >>>> # df -h /workdir/ >>>> Filesystem Size Used Avail Use% Mounted on >>>> localhost:vol_workdir_amd.rdma >>>> 500G 67M 500G 1% /workdir >>>> >>>> # touch /workdir/test >>>> touch: impossible de faire un touch ? /workdir/test ?: Aucun fichier ou dossier de ce type >>>> >>>> # tail -30l /var/log/glusterfs/workdir.log >>>> Host Unreachable, Check your connection with IPoIB >>>> [2015-07-21 21:10:33.927673] W [rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-2: cma event RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1020 peer:10.0.4.1:49174) >>>> Host Unreachable, Check your connection with IPoIB >>>> [2015-07-21 21:10:37.877231] I [rpc-clnt.c:1819:rpc_clnt_reconfig] 0-vol_workdir_amd-client-0: changing port to 49173 (from 0) >>>> [2015-07-21 21:10:37.880556] I [rpc-clnt.c:1819:rpc_clnt_reconfig] 0-vol_workdir_amd-client-2: changing port to 49174 (from 0) >>>> [2015-07-21 21:10:37.914661] W [rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-0: cma event RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1021 peer:10.0.4.1:49173) >>>> Host Unreachable, Check your connection with IPoIB >>>> [2015-07-21 21:10:37.923535] W [rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-2: cma event RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1020 peer:10.0.4.1:49174) >>>> Host Unreachable, Check your connection with IPoIB >>>> [2015-07-21 21:10:41.883925] I [rpc-clnt.c:1819:rpc_clnt_reconfig] 0-vol_workdir_amd-client-0: changing port to 49173 (from 0) >>>> [2015-07-21 21:10:41.887085] I [rpc-clnt.c:1819:rpc_clnt_reconfig] 0-vol_workdir_amd-client-2: changing port to 49174 (from 0) >>>> [2015-07-21 21:10:41.919394] W [rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-0: cma event RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1021 peer:10.0.4.1:49173) >>>> Host Unreachable, Check your connection with IPoIB >>>> [2015-07-21 21:10:41.932622] W [rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-2: cma event RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1020 peer:10.0.4.1:49174) >>>> Host Unreachable, Check your connection with IPoIB >>>> [2015-07-21 21:10:44.682636] W [dht-layout.c:189:dht_layout_search] 0-vol_workdir_amd-dht: no subvolume for hash (value) = 1072520554 >>>> [2015-07-21 21:10:44.682947] W [dht-layout.c:189:dht_layout_search] 0-vol_workdir_amd-dht: no subvolume for hash (value) = 1072520554 >>>> [2015-07-21 21:10:44.683240] W [dht-layout.c:189:dht_layout_search] 0-vol_workdir_amd-dht: no subvolume for hash (value) = 1072520554 >>>> [2015-07-21 21:10:44.683472] W [dht-diskusage.c:48:dht_du_info_cbk] 0-vol_workdir_amd-dht: failed to get disk info from vol_workdir_amd-client-0 >>>> [2015-07-21 21:10:44.683506] W [dht-diskusage.c:48:dht_du_info_cbk] 0-vol_workdir_amd-dht: failed to get disk info from vol_workdir_amd-client-2 >>>> [2015-07-21 21:10:44.683532] W [dht-layout.c:189:dht_layout_search] 0-vol_workdir_amd-dht: no subvolume for hash (value) = 1072520554 >>>> [2015-07-21 21:10:44.683551] W [fuse-bridge.c:1970:fuse_create_cbk] 0-glusterfs-fuse: 18: /test => -1 (Aucun fichier ou dossier de ce type) >>>> [2015-07-21 21:10:44.683619] W [dht-layout.c:189:dht_layout_search] 0-vol_workdir_amd-dht: no subvolume for hash (value) = 1072520554 >>>> [2015-07-21 21:10:44.683846] W [dht-layout.c:189:dht_layout_search] 0-vol_workdir_amd-dht: no subvolume for hash (value) = 1072520554 >>>> [2015-07-21 21:10:45.886807] I [rpc-clnt.c:1819:rpc_clnt_reconfig] 0-vol_workdir_amd-client-0: changing port to 49173 (from 0) >>>> [2015-07-21 21:10:45.893059] I [rpc-clnt.c:1819:rpc_clnt_reconfig] 0-vol_workdir_amd-client-2: changing port to 49174 (from 0) >>>> [2015-07-21 21:10:45.920434] W [rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-0: cma event RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1021 peer:10.0.4.1:49173) >>>> Host Unreachable, Check your connection with IPoIB >>>> [2015-07-21 21:10:45.925292] W [rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-2: cma event RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1020 peer:10.0.4.1:49174) >>>> Host Unreachable, Check your connection with IPoIB >>>> >>>> I use GlusterFS in production since around 3 years without any block >>>> problem but now the situation is awesome since more than 3 weeks? >>>> Indeed, our production are down since roughly 3.5 weeks (with a lot >>>> and different problems with GlusterFS v3.5.3 and now with 3.7.2-3) and >>>> i need to restart it? >>>> >>>> Thanks in advance, >>>> Geoffrey >>>> ------------------------------------------------------ >>>> Geoffrey Letessier >>>> Responsable informatique & ing?nieur syst?me >>>> UPR 9080 - CNRS - Laboratoire de Biochimie Th?orique >>>> Institut de Biologie Physico-Chimique >>>> 13, rue Pierre et Marie Curie - 75005 Paris >>>> Tel: 01 58 41 50 93 - eMail: geoffrey.letessier at ibpc.fr >>>> >>>> Le 21 juil. 2015 ? 19:36, Soumya Koduri <skoduri at redhat.com> a ?crit : >>>> >>>>> From the following errors, >>>>> >>>>> [2015-07-21 14:36:30.495321] I [MSGID: 114020] [client.c:2118:notify] 0-vol_shared-client-0: parent translators are ready, attempting connect on transport >>>>> [2015-07-21 14:36:30.498989] W [socket.c:923:__socket_keepalive] 0-socket: failed to set TCP_USER_TIMEOUT 0 on socket 12, Protocole non disponible >>>>> [2015-07-21 14:36:30.499004] E [socket.c:3015:socket_connect] 0-vol_shared-client-0: Failed to set keep-alive: Protocole non disponible >>>>> >>>>> looks like setting TCP_USER_TIMEOUT value to 0 on the socket failed with error (IIUC) "Protocol not available". >>>>> Could you check if 'network.ping-timeout' is set to zero for that volume using 'gluster volume info'? Anyways from the code looks like 'TCP_USER_TIMEOUT' can take value zero. Not sure why it has failed. >>>>> >>>>> Niels, any thoughts? >>>>> >>>>> Thanks, >>>>> Soumya >>>>> >>>>> On 07/21/2015 08:15 PM, Geoffrey Letessier wrote: >>>>>> [2015-07-21 14:36:30.495321] I [MSGID: 114020] [client.c:2118:notify] >>>>>> 0-vol_shared-client-0: parent translators are ready, attempting connect >>>>>> on transport >>>>>> [2015-07-21 14:36:30.498989] W [socket.c:923:__socket_keepalive] >>>>>> 0-socket: failed to set TCP_USER_TIMEOUT 0 on socket 12, Protocole non >>>>>> disponible >>>>>> [2015-07-21 14:36:30.499004] E [socket.c:3015:socket_connect] >>>>>> 0-vol_shared-client-0: Failed to set keep-alive: Protocole non disponible >>>> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20150722/12ecd264/attachment.html>