thr3ads.net - Gluster users - [Gluster-users] two stability glitches after continuous file operations for a month [Dec 2008]

If this information is useful, please help other people find it:
Share via:

Manhong Dai

2008-Dec-01 16:54 UTC

[Gluster-users] two stability glitches after continuous file operations for a month

Hi,


	After a month's file operations, which included coping 20 million of
small files and about 20 thousand of cluster jobs, I am  overall
satisfied except two stability glitches.


1. A small portion (about 1%?) of jobs got an error of "transport
endpoint not connected", and output file is incomplete. This error
happened on random computing nodes, and it doesn't affect subsequent
jobs on the same node. An example of error message of glusterfsd is 
2008-11-19 23:09:51 E [protocol.c:271:gf_block_unserialize_transport]
server: EOF from peer (172.20.102.2:1022)

Error of glusterfs is either (looks to be caused by brick)
2008-11-19 23:09:52 C [client-protocol.c:212:call_bail] muskie-brick:
bailing transport
2008-11-19 23:09:52 E [client-protocol.c:4834:client_protocol_cleanup]
muskie-brick: forced unwinding frame type(1) op(14) reply=@0x67e2150
2008-11-19 23:09:52 E [client-protocol.c:3254:client_write_cbk]
muskie-brick: no proper reply from server, returning ENOTCONN
2008-11-19 23:09:56 E [write-behind.c:602:wb_writev] wb: delayed error :
107

or (caused by namespace)
2008-11-28 20:47:53 C [client-protocol.c:212:call_bail] muskie-ns:
bailing transport
2008-11-28 20:47:53 E [client-protocol.c:4834:client_protocol_cleanup]
muskie-ns: forced unwinding frame type(1) op(40) reply=@0x1b447cc0
2008-11-28 20:47:53 E [client-protocol.c:4613:client_checksum_cbk]
muskie-ns: no proper reply from server, returning ENOTCONN
2008-11-28 20:47:53 E [client-protocol.c:325:client_protocol_xfer]
muskie-ns: transport_submit failed



2. Right now the process 'glusterfs' takes 1785M virt mem, and 1500 RES
mem, according to top. I hope this is not a memory leak, or at least
there should be a way to reduce memory usage without remounting it.



If somebody can shed some light on these issues, I appreciate it. Just
let me know if you need more detailed information.


Best,
Manhong

Raghavendra G

2008-Dec-02 09:19 UTC

head link

[Gluster-users] two stability glitches after continuous file operations for a month

Hi,

Please find the comments inlined.

On Mon, Dec 1, 2008 at 8:54 PM, Manhong Dai <daimh at umich.edu> wrote:
> Hi,
>
>
>        After a month's file operations, which included coping 20
million of
> small files and about 20 thousand of cluster jobs, I am  overall
> satisfied except two stability glitches.
>
>
> 1. A small portion (about 1%?) of jobs got an error of "transport
> endpoint not connected", and output file is incomplete. This error
> happened on random computing nodes, and it doesn't affect subsequent
> jobs on the same node. An example of error message of glusterfsd is
> 2008-11-19 23:09:51 E [protocol.c:271:gf_block_unserialize_transport]
> server: EOF from peer (172.20.102.2:1022)
>
> Error of glusterfs is either (looks to be caused by brick)
> 2008-11-19 23:09:52 C [client-protocol.c:212:call_bail] muskie-brick:
> bailing transport
> 2008-11-19 23:09:52 E [client-protocol.c:4834:client_protocol_cleanup]
> muskie-brick: forced unwinding frame type(1) op(14) reply=@0x67e2150
> 2008-11-19 23:09:52 E [client-protocol.c:3254:client_write_cbk]
> muskie-brick: no proper reply from server, returning ENOTCONN
> 2008-11-19 23:09:56 E [write-behind.c:602:wb_writev] wb: delayed error :
> 107
>
> or (caused by namespace)
> 2008-11-28 20:47:53 C [client-protocol.c:212:call_bail] muskie-ns:
> bailing transport
> 2008-11-28 20:47:53 E [client-protocol.c:4834:client_protocol_cleanup]
> muskie-ns: forced unwinding frame type(1) op(40) reply=@0x1b447cc0
> 2008-11-28 20:47:53 E [client-protocol.c:4613:client_checksum_cbk]
> muskie-ns: no proper reply from server, returning ENOTCONN
> 2008-11-28 20:47:53 E [client-protocol.c:325:client_protocol_xfer]
> muskie-ns: transport_submit failed
>
>what is the transport timeout you are using? If the transport-timeout is
small and the server is busy serving other requests, there is a good
possibility that the operations are bailing out and resulting in ENOTCONN
errors.

Are you using io-threads on server side? Can you send the configuration
files?

>
> 2. Right now the process 'glusterfs' takes 1785M virt mem, and 1500
RES
> mem, according to top. I hope this is not a memory leak, or at least
> there should be a way to reduce memory usage without remounting it.
>
>
>
> If somebody can shed some light on these issues, I appreciate it. Just
> let me know if you need more detailed information.
>
>
> Best,
> Manhong
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users
>


-- 
Raghavendra G
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://supercolony.gluster.org/pipermail/gluster-users/attachments/20081202/c52b6795/attachment.html>

Gluster users - Dec 2008 - two stability glitches after continuous file operations for a month

[Gluster-users] two stability glitches after continuous file operations for a month

[Gluster-users] two stability glitches after continuous file operations for a month