Manhong Dai
2008-Dec-01 16:54 UTC
[Gluster-users] two stability glitches after continuous file operations for a month
Hi, After a month's file operations, which included coping 20 million of small files and about 20 thousand of cluster jobs, I am overall satisfied except two stability glitches. 1. A small portion (about 1%?) of jobs got an error of "transport endpoint not connected", and output file is incomplete. This error happened on random computing nodes, and it doesn't affect subsequent jobs on the same node. An example of error message of glusterfsd is 2008-11-19 23:09:51 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (172.20.102.2:1022) Error of glusterfs is either (looks to be caused by brick) 2008-11-19 23:09:52 C [client-protocol.c:212:call_bail] muskie-brick: bailing transport 2008-11-19 23:09:52 E [client-protocol.c:4834:client_protocol_cleanup] muskie-brick: forced unwinding frame type(1) op(14) reply=@0x67e2150 2008-11-19 23:09:52 E [client-protocol.c:3254:client_write_cbk] muskie-brick: no proper reply from server, returning ENOTCONN 2008-11-19 23:09:56 E [write-behind.c:602:wb_writev] wb: delayed error : 107 or (caused by namespace) 2008-11-28 20:47:53 C [client-protocol.c:212:call_bail] muskie-ns: bailing transport 2008-11-28 20:47:53 E [client-protocol.c:4834:client_protocol_cleanup] muskie-ns: forced unwinding frame type(1) op(40) reply=@0x1b447cc0 2008-11-28 20:47:53 E [client-protocol.c:4613:client_checksum_cbk] muskie-ns: no proper reply from server, returning ENOTCONN 2008-11-28 20:47:53 E [client-protocol.c:325:client_protocol_xfer] muskie-ns: transport_submit failed 2. Right now the process 'glusterfs' takes 1785M virt mem, and 1500 RES mem, according to top. I hope this is not a memory leak, or at least there should be a way to reduce memory usage without remounting it. If somebody can shed some light on these issues, I appreciate it. Just let me know if you need more detailed information. Best, Manhong
Raghavendra G
2008-Dec-02 09:19 UTC
[Gluster-users] two stability glitches after continuous file operations for a month
Hi, Please find the comments inlined. On Mon, Dec 1, 2008 at 8:54 PM, Manhong Dai <daimh at umich.edu> wrote:> Hi, > > > After a month's file operations, which included coping 20 million of > small files and about 20 thousand of cluster jobs, I am overall > satisfied except two stability glitches. > > > 1. A small portion (about 1%?) of jobs got an error of "transport > endpoint not connected", and output file is incomplete. This error > happened on random computing nodes, and it doesn't affect subsequent > jobs on the same node. An example of error message of glusterfsd is > 2008-11-19 23:09:51 E [protocol.c:271:gf_block_unserialize_transport] > server: EOF from peer (172.20.102.2:1022) > > Error of glusterfs is either (looks to be caused by brick) > 2008-11-19 23:09:52 C [client-protocol.c:212:call_bail] muskie-brick: > bailing transport > 2008-11-19 23:09:52 E [client-protocol.c:4834:client_protocol_cleanup] > muskie-brick: forced unwinding frame type(1) op(14) reply=@0x67e2150 > 2008-11-19 23:09:52 E [client-protocol.c:3254:client_write_cbk] > muskie-brick: no proper reply from server, returning ENOTCONN > 2008-11-19 23:09:56 E [write-behind.c:602:wb_writev] wb: delayed error : > 107 > > or (caused by namespace) > 2008-11-28 20:47:53 C [client-protocol.c:212:call_bail] muskie-ns: > bailing transport > 2008-11-28 20:47:53 E [client-protocol.c:4834:client_protocol_cleanup] > muskie-ns: forced unwinding frame type(1) op(40) reply=@0x1b447cc0 > 2008-11-28 20:47:53 E [client-protocol.c:4613:client_checksum_cbk] > muskie-ns: no proper reply from server, returning ENOTCONN > 2008-11-28 20:47:53 E [client-protocol.c:325:client_protocol_xfer] > muskie-ns: transport_submit failed > >what is the transport timeout you are using? If the transport-timeout is small and the server is busy serving other requests, there is a good possibility that the operations are bailing out and resulting in ENOTCONN errors. Are you using io-threads on server side? Can you send the configuration files?> > 2. Right now the process 'glusterfs' takes 1785M virt mem, and 1500 RES > mem, according to top. I hope this is not a memory leak, or at least > there should be a way to reduce memory usage without remounting it. > > > > If somebody can shed some light on these issues, I appreciate it. Just > let me know if you need more detailed information. > > > Best, > Manhong > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users >-- Raghavendra G -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20081202/c52b6795/attachment.html>