thr3ads.net - Gluster users - [Gluster-users] GlusterFS 3.3 span volume, op(RENAME(8)) fails, "Transport endpoint is not connected" [Jun 2012]

If this information is useful, please help other people find it:
Share via:

Jeff White

2012-Jun-13 14:02 UTC

[Gluster-users] GlusterFS 3.3 span volume, op(RENAME(8)) fails, "Transport endpoint is not connected"

I recently upgraded my dev cluster to 3.3.  To do this I copied the data 
out of the old volume into a bare disk, wiped out everything about 
Gluster, installed the 3.3 packages, create a new volume (I wanted to 
change my brick layout), then copied the data back into the new volume.  
Previously everything worked fine but now my users are complaining of 
random errors when compiling software.

I enabled debug logging for the clients and I see this:

x36e3613ad3] (-->/usr/lib64/libglusterfs.so.0(mem_get0+0x1b) 
[0x36e364018b]))) 0-mem-pool: Mem pool is full. Callocing mem
[2012-06-12 17:12:02.783526] D [mem-pool.c:457:mem_get] 
(-->/usr/lib64/libglusterfs.so.0(dict_unserialize+0x28d) [0x36e361413d] 
(-->/usr/lib64/libglusterfs.so.0(dict_set+0x163) [0
x36e3613ad3] (-->/usr/lib64/libglusterfs.so.0(mem_get0+0x1b) 
[0x36e364018b]))) 0-mem-pool: Mem pool is full. Callocing mem
[2012-06-12 17:12:02.783584] D [mem-pool.c:457:mem_get] 
(-->/usr/lib64/libglusterfs.so.0(dict_unserialize+0x28d) [0x36e361413d] 
(-->/usr/lib64/libglusterfs.so.0(dict_set+0x163) [0
x36e3613ad3] (-->/usr/lib64/libglusterfs.so.0(mem_get0+0x1b) 
[0x36e364018b]))) 0-mem-pool: Mem pool is full. Callocing mem
[2012-06-12 17:12:45.726083] D 
[client-handshake.c:184:client_start_ping] 0-vol_home-client-0: 
returning as transport is already disconnected OR there are no frames (0 
|| 0)
[2012-06-12 17:12:45.726154] D 
[client-handshake.c:184:client_start_ping] 0-vol_home-client-3: 
returning as transport is already disconnected OR there are no frames (0 
|| 0)
[2012-06-12 17:12:45.726171] D 
[client-handshake.c:184:client_start_ping] 0-vol_home-client-1: 
returning as transport is already disconnected OR there are no frames (0 
|| 0)
*[2012-06-12 17:15:35.888437] E [rpc-clnt.c:208:call_bail] 
0-vol_home-client-2: bailing out frame type(GlusterFS 3.1) op(RENAME(8)) 
xid = 0x2015421x sent = 2012-06-12 16:45:26.237621. timeout = 1800*
[2012-06-12 17:15:35.888507] W 
[client3_1-fops.c:2385:client3_1_rename_cbk] 0-vol_home-client-2: remote 
operation failed: Transport endpoint is not connected
[2012-06-12 17:15:35.888529] W [dht-rename.c:478:dht_rename_cbk] 
0-vol_home-dht: 
/sam/senthil/genboree/SupportingPkgs/gcc-3.4.6/x86_64-unknown-linux-gnu/32/libjava/java/net/SocketException.class.tmp:
rename on vol_home-client-2 failed (Transport endpoint is not connected)
[2012-06-12 17:15:35.889803] W [fuse-bridge.c:1516:fuse_rename_cbk] 
0-glusterfs-fuse: 2776710: 
/sam/senthil/genboree/SupportingPkgs/gcc-3.4.6/x86_64-unknown-linux-gnu/32/libjava/java/net/SocketException.class.tmp
-> 
/sam/senthil/genboree/SupportingPkgs/gcc-3.4.6/x86_64-unknown-linux-gnu/32/libjava/java/net/SocketException.class
=> -1 (Transport endpoint is not connected)
[2012-06-12 17:15:35.890002] D [mem-pool.c:457:mem_get] 
(-->/usr/lib64/libglusterfs.so.0(dict_new+0xb) [0x36e3613d6b] 
(-->/usr/lib64/libglusterfs.so.0(get_new_dict_full+0x27) [0x36e3613c67] 
(-->/usr/lib64/libglusterfs.so.0(mem_get0+0x1b) [0x36e364018b]))) 
0-mem-pool: Mem pool is full. Callocing mem
[2012-06-12 17:15:35.890167] D [mem-pool.c:457:mem_get] 
(-->/usr/lib64/glusterfs/3.3.0/xlator/performance/md-cache.so(mdc_load_reqs+0x3d)
[0x2aaaac201a2d] (-->/usr/lib64/libglusterfs.so.0(dict_set+0x163) 
[0x36e3613ad3] (-->/usr/lib64/libglusterfs.so.0(mem_get0+0x1b) 
[0x36e364018b]))) 0-mem-pool: Mem pool is full. Callocing mem
[2012-06-12 17:15:35.890258] D [mem-pool.c:457:mem_get] 
(-->/usr/lib64/glusterfs/3.3.0/xlator/performance/md-cache.so(mdc_load_reqs+0x3d)
[0x2aaaac201a2d] (-->/usr/lib64/libglusterfs.so.0(dict_set+0x163) 
[0x36e3613ad3] (-->/usr/lib64/libglusterfs.so.0(mem_get0+0x1b) 
[0x36e364018b]))) 0-mem-pool: Mem pool is full. Callocing mem
[2012-06-12 17:15:35.890311] D [mem-pool.c:457:mem_get] 
(-->/usr/lib64/glusterfs/3.3.0/xlator/performance/md-cache.so(mdc_load_reqs+0x3d)
[0x2aaaac201a2d] (-->/usr/lib64/libglusterfs.so.0(dict_set+0x163) 
[0x36e3613ad3] (-->/usr/lib64/libglusterfs.so.0(mem_get0+0x1b) 
[0x36e364018b]))) 0-mem-pool: Mem pool is full. Callocing mem
[2012-06-12 17:15:35.890363] D [mem-pool.c:457:mem_get] 
(-->/usr/lib64/glusterfs/3.3.0/xlator/performance/md-cache.so(mdc_load_reqs+0x3d)
[0x2aaaac201a2d] (-->/usr/lib64/libglusterfs.so.0(dict_set+0x163) 
[0x36e3613ad3] (-->/usr/lib64/libglusterfs.so.0(mem_get0+0x1b) 
[0x36e364018b]))) 0-mem-pool: Mem pool is full. Callocing mem
** and so on, more of the same...

If I enable debug logging on the bricks I see thousands of these lines 
every minute and I'm forced to disable the logging:

[2012-06-12 15:32:45.760598] D [io-threads.c:268:iot_schedule] 
0-vol_home-io-threads: LOOKUP scheduled as fast fop

Here's my config:

# gluster volume info
Volume Name: vol_home
Type: Distribute
Volume ID: 07ec60be-ec0c-4579-a675-069bb34c12ab
Status: Started
Number of Bricks: 4
Transport-type: tcp
Bricks:
Brick1: storage0-dev.cssd.pitt.edu:/brick/0
Brick2: storage1-dev.cssd.pitt.edu:/brick/2
Brick3: storage0-dev.cssd.pitt.edu:/brick/1
Brick4: storage1-dev.cssd.pitt.edu:/brick/3
Options Reconfigured:
diagnostics.brick-log-level: INFO
diagnostics.client-log-level: INFO
features.limit-usage: /home/cssd/jaw171:50GB,/cssd:200GB,/cssd/jaw171:75GB
nfs.rpc-auth-allow: 10.54.50.*,127.*
auth.allow: 10.54.50.*,127.*
performance.io-cache: off
cluster.min-free-disk: 5
performance.cache-size: 128000000
features.quota: on
nfs.disable: on

# rpm -qa | grep gluster
glusterfs-fuse-3.3.0-1.el6.x86_64
glusterfs-server-3.3.0-1.el6.x86_64
glusterfs-3.3.0-1.el6.x86_64

Name resolution is fine on everything, everything can ping everything 
else by name, no firewalls are running anywhere, there's no disk errors 
on the storage nodes.

Did the way I copied data out of one volume and back into another cause 
this (some xattr problem)?  What else could be causing this problem?  
I'm looking to go production with GlusterFS on a 242 (soon to grow) node 
HPC cluster at the end of this month.

Also, one of my co-workers improved upon an existing remote quota viewer 
written in Python.  I'll post the code soon for those interested.

-- 
Jeff White - Linux/Unix Systems Engineer
University of Pittsburgh - CSSD

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://supercolony.gluster.org/pipermail/gluster-users/attachments/20120613/4756895e/attachment.html>

Anand Avati

2012-Jun-13 17:37 UTC

head link

[Gluster-users] GlusterFS 3.3 span volume, op(RENAME(8)) fails, "Transport endpoint is not connected"

Can you get a process state dump of the brick process hosting
'0-vol_home-client-2' subvolume? That should give some clues about what
happened to the missing rename call.

Avati

On Wed, Jun 13, 2012 at 7:02 AM, Jeff White <jaw171 at pitt.edu> wrote:
>  I recently upgraded my dev cluster to 3.3.  To do this I copied the data
> out of the old volume into a bare disk, wiped out everything about Gluster,
> installed the 3.3 packages, create a new volume (I wanted to change my
> brick layout), then copied the data back into the new volume.  Previously
> everything worked fine but now my users are complaining of random errors
> when compiling software.
>
> I enabled debug logging for the clients and I see this:
>
> x36e3613ad3] (-->/usr/lib64/libglusterfs.so.0(mem_get0+0x1b)
> [0x36e364018b]))) 0-mem-pool: Mem pool is full. Callocing mem
> [2012-06-12 17:12:02.783526] D [mem-pool.c:457:mem_get]
> (-->/usr/lib64/libglusterfs.so.0(dict_unserialize+0x28d) [0x36e361413d]
> (-->/usr/lib64/libglusterfs.so.0(dict_set+0x163) [0
> x36e3613ad3] (-->/usr/lib64/libglusterfs.so.0(mem_get0+0x1b)
> [0x36e364018b]))) 0-mem-pool: Mem pool is full. Callocing mem
> [2012-06-12 17:12:02.783584] D [mem-pool.c:457:mem_get]
> (-->/usr/lib64/libglusterfs.so.0(dict_unserialize+0x28d) [0x36e361413d]
> (-->/usr/lib64/libglusterfs.so.0(dict_set+0x163) [0
> x36e3613ad3] (-->/usr/lib64/libglusterfs.so.0(mem_get0+0x1b)
> [0x36e364018b]))) 0-mem-pool: Mem pool is full. Callocing mem
> [2012-06-12 17:12:45.726083] D [client-handshake.c:184:client_start_ping]
> 0-vol_home-client-0: returning as transport is already disconnected OR
> there are no frames (0 || 0)
> [2012-06-12 17:12:45.726154] D [client-handshake.c:184:client_start_ping]
> 0-vol_home-client-3: returning as transport is already disconnected OR
> there are no frames (0 || 0)
> [2012-06-12 17:12:45.726171] D [client-handshake.c:184:client_start_ping]
> 0-vol_home-client-1: returning as transport is already disconnected OR
> there are no frames (0 || 0)
> *[2012-06-12 17:15:35.888437] E [rpc-clnt.c:208:call_bail]
> 0-vol_home-client-2: bailing out frame type(GlusterFS 3.1) op(RENAME(8))
> xid = 0x2015421x sent = 2012-06-12 16:45:26.237621. timeout = 1800*
> [2012-06-12 17:15:35.888507] W
> [client3_1-fops.c:2385:client3_1_rename_cbk] 0-vol_home-client-2: remote
> operation failed: Transport endpoint is not connected
> [2012-06-12 17:15:35.888529] W [dht-rename.c:478:dht_rename_cbk]
> 0-vol_home-dht:
>
/sam/senthil/genboree/SupportingPkgs/gcc-3.4.6/x86_64-unknown-linux-gnu/32/libjava/java/net/SocketException.class.tmp:
> rename on vol_home-client-2 failed (Transport endpoint is not connected)
> [2012-06-12 17:15:35.889803] W [fuse-bridge.c:1516:fuse_rename_cbk]
> 0-glusterfs-fuse: 2776710:
>
/sam/senthil/genboree/SupportingPkgs/gcc-3.4.6/x86_64-unknown-linux-gnu/32/libjava/java/net/SocketException.class.tmp
> ->
>
/sam/senthil/genboree/SupportingPkgs/gcc-3.4.6/x86_64-unknown-linux-gnu/32/libjava/java/net/SocketException.class
> => -1 (Transport endpoint is not connected)
> [2012-06-12 17:15:35.890002] D [mem-pool.c:457:mem_get]
> (-->/usr/lib64/libglusterfs.so.0(dict_new+0xb) [0x36e3613d6b]
> (-->/usr/lib64/libglusterfs.so.0(get_new_dict_full+0x27) [0x36e3613c67]
> (-->/usr/lib64/libglusterfs.so.0(mem_get0+0x1b) [0x36e364018b])))
> 0-mem-pool: Mem pool is full. Callocing mem
> [2012-06-12 17:15:35.890167] D [mem-pool.c:457:mem_get]
>
(-->/usr/lib64/glusterfs/3.3.0/xlator/performance/md-cache.so(mdc_load_reqs+0x3d)
> [0x2aaaac201a2d] (-->/usr/lib64/libglusterfs.so.0(dict_set+0x163)
> [0x36e3613ad3] (-->/usr/lib64/libglusterfs.so.0(mem_get0+0x1b)
> [0x36e364018b]))) 0-mem-pool: Mem pool is full. Callocing mem
> [2012-06-12 17:15:35.890258] D [mem-pool.c:457:mem_get]
>
(-->/usr/lib64/glusterfs/3.3.0/xlator/performance/md-cache.so(mdc_load_reqs+0x3d)
> [0x2aaaac201a2d] (-->/usr/lib64/libglusterfs.so.0(dict_set+0x163)
> [0x36e3613ad3] (-->/usr/lib64/libglusterfs.so.0(mem_get0+0x1b)
> [0x36e364018b]))) 0-mem-pool: Mem pool is full. Callocing mem
> [2012-06-12 17:15:35.890311] D [mem-pool.c:457:mem_get]
>
(-->/usr/lib64/glusterfs/3.3.0/xlator/performance/md-cache.so(mdc_load_reqs+0x3d)
> [0x2aaaac201a2d] (-->/usr/lib64/libglusterfs.so.0(dict_set+0x163)
> [0x36e3613ad3] (-->/usr/lib64/libglusterfs.so.0(mem_get0+0x1b)
> [0x36e364018b]))) 0-mem-pool: Mem pool is full. Callocing mem
> [2012-06-12 17:15:35.890363] D [mem-pool.c:457:mem_get]
>
(-->/usr/lib64/glusterfs/3.3.0/xlator/performance/md-cache.so(mdc_load_reqs+0x3d)
> [0x2aaaac201a2d] (-->/usr/lib64/libglusterfs.so.0(dict_set+0x163)
> [0x36e3613ad3] (-->/usr/lib64/libglusterfs.so.0(mem_get0+0x1b)
> [0x36e364018b]))) 0-mem-pool: Mem pool is full. Callocing mem
> ** and so on, more of the same...
>
> If I enable debug logging on the bricks I see thousands of these lines
> every minute and I'm forced to disable the logging:
>
> [2012-06-12 15:32:45.760598] D [io-threads.c:268:iot_schedule]
> 0-vol_home-io-threads: LOOKUP scheduled as fast fop
>
> Here's my config:
>
> # gluster volume info
> Volume Name: vol_home
> Type: Distribute
> Volume ID: 07ec60be-ec0c-4579-a675-069bb34c12ab
> Status: Started
> Number of Bricks: 4
> Transport-type: tcp
> Bricks:
> Brick1: storage0-dev.cssd.pitt.edu:/brick/0
> Brick2: storage1-dev.cssd.pitt.edu:/brick/2
> Brick3: storage0-dev.cssd.pitt.edu:/brick/1
> Brick4: storage1-dev.cssd.pitt.edu:/brick/3
> Options Reconfigured:
> diagnostics.brick-log-level: INFO
> diagnostics.client-log-level: INFO
> features.limit-usage: /home/cssd/jaw171:50GB,/cssd:200GB,/cssd/jaw171:75GB
> nfs.rpc-auth-allow: 10.54.50.*,127.*
> auth.allow: 10.54.50.*,127.*
> performance.io-cache: off
> cluster.min-free-disk: 5
> performance.cache-size: 128000000
> features.quota: on
> nfs.disable: on
>
> # rpm -qa | grep gluster
> glusterfs-fuse-3.3.0-1.el6.x86_64
> glusterfs-server-3.3.0-1.el6.x86_64
> glusterfs-3.3.0-1.el6.x86_64
>
> Name resolution is fine on everything, everything can ping everything else
> by name, no firewalls are running anywhere, there's no disk errors on
the
> storage nodes.
>
> Did the way I copied data out of one volume and back into another cause
> this (some xattr problem)?  What else could be causing this problem? 
I'm
> looking to go production with GlusterFS on a 242 (soon to grow) node HPC
> cluster at the end of this month.
>
> Also, one of my co-workers improved upon an existing remote quota viewer
> written in Python.  I'll post the code soon for those interested.
>
> --
> Jeff White - Linux/Unix Systems Engineer
> University of Pittsburgh - CSSD
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://supercolony.gluster.org/pipermail/gluster-users/attachments/20120613/f81e5164/attachment.html>

Gluster users - Jun 2012 - GlusterFS 3.3 span volume, op(RENAME(8)) fails, "Transport endpoint is not connected"

[Gluster-users] GlusterFS 3.3 span volume, op(RENAME(8)) fails, "Transport endpoint is not connected"

[Gluster-users] GlusterFS 3.3 span volume, op(RENAME(8)) fails, "Transport endpoint is not connected"