thr3ads.net - Gluster users - [Gluster-users] Failure: "Transport endpoint is not connected" (but it is!) [Jan 2009]

If this information is useful, please help other people find it:
Share via:

Benjamin Smith

2009-Jan-20 00:27 UTC

[Gluster-users] Failure: "Transport endpoint is not connected" (but it is!)

Late last week, I rolled out GlusterFS on our production cluster. Config is 
very simple, two active servers that are also clients to each other. Usage is 
for a fairly low-volume distribution of file settings for an application 
cluster that are updated perhaps a few times per day and read constantly. 
(pretty much every web page hit)  Here are the numbers: 

OS: CentOS 4, Linux 2.6.9-78.0.13.ELsmp
HW: Multicore Opteron, X86/64, 4 GB ECC RAM, SCSI, software RAID 1 
Transport: GB ethernet 
Fuse: 2.7.4-1 el4
dkms-fuse: 2.7.4-1 
GlusterFS 1.3.12 (built as RPM from tarball) 
Config: (At the bottom of this email) 

Got a complaint today, "Servers down!". When I did a "df" to
see what's going
on, I got a "Transport endpoint is not connected" message (or similar)
next
to the GlusterFS client partition. Yet in all cases, I could ping/connect to 
the "other" system, and both DNS servers were working fine. 

More interested in restoring service than forensics, I did the following: 

1) Shutdown gluster client, and started back up. Result? df command worked as 
expected, but the files still could not be read. 

2) Shutdown gluster client, gluster server, then restarted in reverse order. 
Everything was then back up instantly. 

glusterfs.log has about 3.5 million (no kidding!) entries, small sample below. 
The only entries in glusterfsd.log are those of my resetting them. =/ Any 
idea what causes this? 

// GLUSTERFS.LOG
2009-01-19 14:12:17 E [client-protocol.c:4430:client_lookup_cbk] remote2: no 
proper reply from server, returning ENOTCONN
2009-01-19 14:12:17 W [client-protocol.c:332:client_protocol_xfer] remote2: 
not connected at the moment to submit frame type(1) op(34)
2009-01-19 14:12:17 E [client-protocol.c:4430:client_lookup_cbk] remote2: no 
proper reply from server, returning ENOTCONN
2009-01-19 14:12:17 W [client-protocol.c:332:client_protocol_xfer] remote2: 
not connected at the moment to submit frame type(1) op(34)
2009-01-19 14:12:17 E [client-protocol.c:4430:client_lookup_cbk] remote2: no 
proper reply from server, returning ENOTCONN
2009-01-19 14:12:17 W [client-protocol.c:332:client_protocol_xfer] remote2: 
not connected at the moment to submit frame type(1) op(34)
2009-01-19 14:12:17 E [client-protocol.c:4430:client_lookup_cbk] remote2: no 
proper reply from server, returning ENOTCONN
2009-01-19 14:12:23 W [client-protocol.c:332:client_protocol_xfer] remote2: 
not connected at the moment to submit frame type(1) op(34)
2009-01-19 14:12:23 E [client-protocol.c:4430:client_lookup_cbk] remote2: no 
proper reply from server, returning ENOTCONN
2009-01-19 14:12:23 W [client-protocol.c:332:client_protocol_xfer] remote2: 
not connected at the moment to submit frame type(1) op(34)
2009-01-19 14:12:23 E [client-protocol.c:4430:client_lookup_cbk] remote2: no 
proper reply from server, returning ENOTCON



-- SERVER FILE --> Volume brick
> type storage/posix
> option directory /home/uroot/home/cworks/.data 
> end-volume 
> volume server 
> type protocol/server 
> subvolumes brick 
> option transport-type tcp/server 
> option auth.ip.brick.allow 192.168.254.* 
> end-volume 

-- CLIENT FILE -->  volume remote1
>  type protocol/client 
>  option transport-type tcp/client 
>  option remote-host glusterfs1.spfs 
>  option remote-subvolume brick 
>  end-volume 
> volume remote2
> type protocol/client 
> option transport-type tcp/client 
> option remote-host glusterfs2.spfs 
> option remote-subvolume brick 
> end-volume 
> volume mirror0 
> type cluster/afr 
> subvolumes remote1 remote2 
> end-volume 
-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

Anand Avati

2009-Jan-20 01:05 UTC

head link

[Gluster-users] Failure: "Transport endpoint is not connected" (but it is!)

Do you have a coredump on the server? was glusterfsd running on the
server at all?

avati

On Tue, Jan 20, 2009 at 5:57 AM, Benjamin Smith
<lists at benjamindsmith.com> wrote:> Late last week, I rolled out GlusterFS on our production cluster. Config is
> very simple, two active servers that are also clients to each other. Usage
is
> for a fairly low-volume distribution of file settings for an application
> cluster that are updated perhaps a few times per day and read constantly.
> (pretty much every web page hit)  Here are the numbers:

Benjamin Smith

2009-Jan-21 00:34 UTC

head link

[Gluster-users] Failure: "Transport endpoint is not connected" (but it is!)

Haven't seen the "Transport endpoint" problem again, yet. Instead,
a different
problem surfaced this morning. 

The servers began to "hang", taking 60 seconds or more to return a
read. lsof
showed many open files in the GlusterFS partition, all being read. Traffic 
monitor showed extremely high volume data flow (essentially, 1 Gb) between 
the primary webserver and its glusterfs server twin. 

Shutting down the webserver, glusterfs as client, and glusterfs as server, 
then restarting the whole stack from server to client to apache resulted in a 
system that was responsive - for a while. The reads were nearly all of the 
same few dozen files that I want to have replicated on GlusterFS. 

Based on what I've seen, I guess that: 

1) Glusterfs does some kind of coherency check at every file read. 

2) Glusterfs processes these coherency checks serially, 

3) The coherency checking was backing up. 

Am I out in left field, here? Is there something terrible and fundamental that 
I'm missing, or is GlusterFS + Ethernet + stock Fuse + basic config just not
going to do all that well with medium-to-large amounts of reads of a few 
hundred small files? (say, 50/second) 

I ended up rolling back glusterfs and went back to a single, local file 
system, and would like to move forward on this... 

On Monday 19 January 2009 11:59:05 pm you wrote:> 1. whether glusterfsd is running on the server or not, with the
> process state (from ps) if running
> 2. backtrace of coredump using gdb if it has crashed.
>
> we can figure the next step only after having one of the above two
>
> avati
>
> On Tue, Jan 20, 2009 at 1:24 PM, Benjamin Smith
>
> <lists at benjamindsmith.com> wrote:
> > On Monday 19 January 2009 05:05:20 pm you wrote:
> >> Do you have a coredump on the server? was glusterfsd running on
the
> >> server at all?
> >
> > No.
> >
> > If it should happen again, what should I do to provide you with what
you
> > need?
> >
> > -Ben
> >
> >> <lists at benjamindsmith.com> wrote:
> >> > Late last week, I rolled out GlusterFS on our production
cluster.
> >> > Config is very simple, two active servers that are also
clients to
> >> > each other. Usage is for a fairly low-volume distribution of
file
> >> > settings for an application cluster that are updated perhaps
a few
> >> > times per day and read constantly. (pretty much every web
page hit)
> >> > Here are the numbers:
> >>
> >> --
> >> This message has been scanned for viruses and
> >> dangerous content by MailScanner, and is
> >> believed to be clean.
> >
> > --
> > This message has been scanned for viruses and
> > dangerous content by MailScanner, and is
> > believed to be clean.
>
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

Gluster users - Jan 2009 - Failure: "Transport endpoint is not connected" (but it is!)

[Gluster-users] Failure: "Transport endpoint is not connected" (but it is!)

[Gluster-users] Failure: "Transport endpoint is not connected" (but it is!)

[Gluster-users] Failure: "Transport endpoint is not connected" (but it is!)