Richard Hesse
2008-Feb-12 03:23 UTC
[Fedora-directory-users] FDS 1.1 Transport endpoint is not connected
Started to play with FDS 1.1 for some dogfood testing. After running for 10-15 minutes, the server stopped responding to network requests and went silent. The process was running, the error log was updating with the ldbm event loop, but no socket requests were fulfilled. Checking the access log, I saw this: [12/Feb/2008:01:47:58 +0000] conn=71108 op=-1 fd=79 closed error 107 (Transport endpoint is not connected) - Network file descriptor is not connected. [12/Feb/2008:01:47:59 +0000] conn=71007 op=60 fd=69 closed - B4 [12/Feb/2008:01:48:00 +0000] conn=71003 op=48 fd=68 closed - B4 [12/Feb/2008:01:48:01 +0000] conn=71017 op=47 fd=72 closed - B4 [12/Feb/2008:01:48:06 +0000] conn=71102 op=2 fd=66 closed - B4 [12/Feb/2008:01:48:07 +0000] conn=71103 op=2 fd=70 closed - B4 [12/Feb/2008:01:48:07 +0000] conn=71040 op=10 fd=76 closed - B4 Any ideas or suggestions on how to approach troubleshooting this issue would be greatly appreciated. Thanks. -richard
Richard Megginson
2008-Feb-12 03:43 UTC
Re: [Fedora-directory-users] FDS 1.1 Transport endpoint is not connected
Richard Hesse wrote:> Started to play with FDS 1.1 for some dogfood testing. After running for 10-15 minutes, the server stopped responding to network requests and went silent. The process was running, the error log was updating with the ldbm event loop, but no socket requests were fulfilled. Checking the access log, I saw this: > > [12/Feb/2008:01:47:58 +0000] conn=71108 op=-1 fd=79 closed error 107 (Transport endpoint is not connected) - Network file descriptor is not connected. > [12/Feb/2008:01:47:59 +0000] conn=71007 op=60 fd=69 closed - B4 > [12/Feb/2008:01:48:00 +0000] conn=71003 op=48 fd=68 closed - B4 > [12/Feb/2008:01:48:01 +0000] conn=71017 op=47 fd=72 closed - B4 > [12/Feb/2008:01:48:06 +0000] conn=71102 op=2 fd=66 closed - B4 > [12/Feb/2008:01:48:07 +0000] conn=71103 op=2 fd=70 closed - B4 > [12/Feb/2008:01:48:07 +0000] conn=71040 op=10 fd=76 closed - B4 > > Any ideas or suggestions on how to approach troubleshooting this issue would be greatly appreciated. >B4 means SLAPD_DISCONNECT_BER_FLUSH - this usually means the client has reset or closed the connection while the server was attempting to send a response. http://www.redhat.com/docs/manuals/dir-server/cli/8.0/Configuration_Command_File_Reference-Access_Log_and_Connection_Code_Reference-Common_Connection_Codes.html Do you have a firewall or some other network device?> Thanks. > > -richard > > -- > Fedora-directory-users mailing list > Fedora-directory-users@redhat.com > https://www.redhat.com/mailman/listinfo/fedora-directory-users >
Richard Hesse
2008-Feb-12 18:44 UTC
RE: [Fedora-directory-users] FDS 1.1 Transport endpoint is not connected
There''s a load balancer acting as the client to the DS (proxying client requests). I think that''s a red herring though. Any search requests sent directly to the DS, bypassing the LB, would fail. I think I even tried requests locally from the server and they still failed. I can''t be sure about that last statement, it was a long day. What about the network file descriptor is not connected error? Thanks. -richard -----Original Message----- From: fedora-directory-users-bounces@redhat.com [mailto:fedora-directory-users-bounces@redhat.com] On Behalf Of Richard Megginson Sent: Monday, February 11, 2008 7:43 PM To: General discussion list for the Fedora Directory server project. Subject: Re: [Fedora-directory-users] FDS 1.1 Transport endpoint is not connected Richard Hesse wrote:> Started to play with FDS 1.1 for some dogfood testing. After running for 10-15 minutes, the server stopped responding to network requests and went silent. The process was running, the error log was updating with the ldbm event loop, but no socket requests were fulfilled. Checking the access log, I saw this: > > [12/Feb/2008:01:47:58 +0000] conn=71108 op=-1 fd=79 closed error 107 (Transport endpoint is not connected) - Network file descriptor is not connected. > [12/Feb/2008:01:47:59 +0000] conn=71007 op=60 fd=69 closed - B4 > [12/Feb/2008:01:48:00 +0000] conn=71003 op=48 fd=68 closed - B4 > [12/Feb/2008:01:48:01 +0000] conn=71017 op=47 fd=72 closed - B4 > [12/Feb/2008:01:48:06 +0000] conn=71102 op=2 fd=66 closed - B4 > [12/Feb/2008:01:48:07 +0000] conn=71103 op=2 fd=70 closed - B4 > [12/Feb/2008:01:48:07 +0000] conn=71040 op=10 fd=76 closed - B4 > > Any ideas or suggestions on how to approach troubleshooting this issue would be greatly appreciated. >B4 means SLAPD_DISCONNECT_BER_FLUSH - this usually means the client has reset or closed the connection while the server was attempting to send a response. http://www.redhat.com/docs/manuals/dir-server/cli/8.0/Configuration_Command_File_Reference-Access_Log_and_Connection_Code_Reference-Common_Connection_Codes.html Do you have a firewall or some other network device?> Thanks. > > -richard > > -- > Fedora-directory-users mailing list > Fedora-directory-users@redhat.com > https://www.redhat.com/mailman/listinfo/fedora-directory-users >
Richard Megginson
2008-Feb-12 20:32 UTC
Re: [Fedora-directory-users] FDS 1.1 Transport endpoint is not connected
Richard Hesse wrote:> There''s a load balancer acting as the client to the DS (proxying client requests). I think that''s a red herring though. Any search requests sent directly to the DS, bypassing the LB, would fail. I think I even tried requests locally from the server and they still failed. I can''t be sure about that last statement, it was a long day. >What are all of these closed connections from? e.g. conn=71007, conn=71003, etc.? Are they from the load balancer? I''m not really sure how to proceed to diagnose this from the directory server because events like these usually indicate something is happening at the TCP/IP layer. I would be really interested to see if you continued to have problems if you shut off the load balancer completely and just contacted the directory server via the loopback interface.> What about the network file descriptor is not connected error? >It''s similar to the B4 - it means there was a problem with the connection to the client.> Thanks. > > -richard > > -----Original Message----- > From: fedora-directory-users-bounces@redhat.com [mailto:fedora-directory-users-bounces@redhat.com] On Behalf Of Richard Megginson > Sent: Monday, February 11, 2008 7:43 PM > To: General discussion list for the Fedora Directory server project. > Subject: Re: [Fedora-directory-users] FDS 1.1 Transport endpoint is not connected > > Richard Hesse wrote: > >> Started to play with FDS 1.1 for some dogfood testing. After running for 10-15 minutes, the server stopped responding to network requests and went silent. The process was running, the error log was updating with the ldbm event loop, but no socket requests were fulfilled. Checking the access log, I saw this: >> >> [12/Feb/2008:01:47:58 +0000] conn=71108 op=-1 fd=79 closed error 107 (Transport endpoint is not connected) - Network file descriptor is not connected. >> [12/Feb/2008:01:47:59 +0000] conn=71007 op=60 fd=69 closed - B4 >> [12/Feb/2008:01:48:00 +0000] conn=71003 op=48 fd=68 closed - B4 >> [12/Feb/2008:01:48:01 +0000] conn=71017 op=47 fd=72 closed - B4 >> [12/Feb/2008:01:48:06 +0000] conn=71102 op=2 fd=66 closed - B4 >> [12/Feb/2008:01:48:07 +0000] conn=71103 op=2 fd=70 closed - B4 >> [12/Feb/2008:01:48:07 +0000] conn=71040 op=10 fd=76 closed - B4 >> >> Any ideas or suggestions on how to approach troubleshooting this issue would be greatly appreciated. >> >> > B4 means SLAPD_DISCONNECT_BER_FLUSH - this usually means the client has reset or closed the connection while the server was attempting to send a response. > > http://www.redhat.com/docs/manuals/dir-server/cli/8.0/Configuration_Command_File_Reference-Access_Log_and_Connection_Code_Reference-Common_Connection_Codes.html > > Do you have a firewall or some other network device? > >> Thanks. >> >> -richard >> >> -- >> Fedora-directory-users mailing list >> Fedora-directory-users@redhat.com >> https://www.redhat.com/mailman/listinfo/fedora-directory-users >> >> > > > -- > Fedora-directory-users mailing list > Fedora-directory-users@redhat.com > https://www.redhat.com/mailman/listinfo/fedora-directory-users >
Richard Hesse
2008-Feb-14 19:21 UTC
Re: [Fedora-directory-users] FDS 1.1 Transport endpoint is not connected
Actually, it ends up that debug logging was putting too much disk load on the server and the process fell behind/stopped servicing socket requests. Thanks for your help Richard. -richard On 2/12/08 12:32 PM, "Richard Megginson" <rmeggins@redhat.com> wrote: Richard Hesse wrote:> There''s a load balancer acting as the client to the DS (proxying client requests). I think that''s a red herring though. Any search requests sent directly to the DS, bypassing the LB, would fail. I think I even tried requests locally from the server and they still failed. I can''t be sure about that last statement, it was a long day. >What are all of these closed connections from? e.g. conn=71007, conn=71003, etc.? Are they from the load balancer? I''m not really sure how to proceed to diagnose this from the directory server because events like these usually indicate something is happening at the TCP/IP layer. I would be really interested to see if you continued to have problems if you shut off the load balancer completely and just contacted the directory server via the loopback interface.> What about the network file descriptor is not connected error? >It''s similar to the B4 - it means there was a problem with the connection to the client.> Thanks. > > -richard > > -----Original Message----- > From: fedora-directory-users-bounces@redhat.com [mailto:fedora-directory-users-bounces@redhat.com] On Behalf Of Richard Megginson > Sent: Monday, February 11, 2008 7:43 PM > To: General discussion list for the Fedora Directory server project. > Subject: Re: [Fedora-directory-users] FDS 1.1 Transport endpoint is not connected > > Richard Hesse wrote: > >> Started to play with FDS 1.1 for some dogfood testing. After running for 10-15 minutes, the server stopped responding to network requests and went silent. The process was running, the error log was updating with the ldbm event loop, but no socket requests were fulfilled. Checking the access log, I saw this: >> >> [12/Feb/2008:01:47:58 +0000] conn=71108 op=-1 fd=79 closed error 107 (Transport endpoint is not connected) - Network file descriptor is not connected. >> [12/Feb/2008:01:47:59 +0000] conn=71007 op=60 fd=69 closed - B4 >> [12/Feb/2008:01:48:00 +0000] conn=71003 op=48 fd=68 closed - B4 >> [12/Feb/2008:01:48:01 +0000] conn=71017 op=47 fd=72 closed - B4 >> [12/Feb/2008:01:48:06 +0000] conn=71102 op=2 fd=66 closed - B4 >> [12/Feb/2008:01:48:07 +0000] conn=71103 op=2 fd=70 closed - B4 >> [12/Feb/2008:01:48:07 +0000] conn=71040 op=10 fd=76 closed - B4 >> >> Any ideas or suggestions on how to approach troubleshooting this issue would be greatly appreciated. >> >> > B4 means SLAPD_DISCONNECT_BER_FLUSH - this usually means the client has reset or closed the connection while the server was attempting to send a response. > > http://www.redhat.com/docs/manuals/dir-server/cli/8.0/Configuration_Command_File_Reference-Access_Log_and_Connection_Code_Reference-Common_Connection_Codes.html > > Do you have a firewall or some other network device? > >> Thanks. >> >> -richard >> >> -- >> Fedora-directory-users mailing list >> Fedora-directory-users@redhat.com >> https://www.redhat.com/mailman/listinfo/fedora-directory-users >> >> > > > -- > Fedora-directory-users mailing list > Fedora-directory-users@redhat.com > https://www.redhat.com/mailman/listinfo/fedora-directory-users >
Richard Hesse
2008-Feb-15 20:04 UTC
Re: [Fedora-directory-users] FDS 1.1 Transport endpoint is not connected
Eh sorry about this but it appears that my original hunch was correct. The 1.1 DS instance did indeed hang again recently. I was able to check a localhost query and that failed, too. So the problem definitely appears to be a hang in the FDS code somewhere. The question is, how do I go about debugging this? Strace doesn''t show much at all. Enabling debug trace logging kills the server. Any ideas? Thanks. -richard On 2/14/08 11:21 AM, "Richard Hesse" <richard@powerset.com> wrote: Actually, it ends up that debug logging was putting too much disk load on the server and the process fell behind/stopped servicing socket requests. Thanks for your help Richard. -richard On 2/12/08 12:32 PM, "Richard Megginson" <rmeggins@redhat.com> wrote: Richard Hesse wrote:> There''s a load balancer acting as the client to the DS (proxying client requests). I think that''s a red herring though. Any search requests sent directly to the DS, bypassing the LB, would fail. I think I even tried requests locally from the server and they still failed. I can''t be sure about that last statement, it was a long day. >What are all of these closed connections from? e.g. conn=71007, conn=71003, etc.? Are they from the load balancer? I''m not really sure how to proceed to diagnose this from the directory server because events like these usually indicate something is happening at the TCP/IP layer. I would be really interested to see if you continued to have problems if you shut off the load balancer completely and just contacted the directory server via the loopback interface.> What about the network file descriptor is not connected error? >It''s similar to the B4 - it means there was a problem with the connection to the client.> Thanks. > > -richard > > -----Original Message----- > From: fedora-directory-users-bounces@redhat.com [mailto:fedora-directory-users-bounces@redhat.com] On Behalf Of Richard Megginson > Sent: Monday, February 11, 2008 7:43 PM > To: General discussion list for the Fedora Directory server project. > Subject: Re: [Fedora-directory-users] FDS 1.1 Transport endpoint is not connected > > Richard Hesse wrote: > >> Started to play with FDS 1.1 for some dogfood testing. After running for 10-15 minutes, the server stopped responding to network requests and went silent. The process was running, the error log was updating with the ldbm event loop, but no socket requests were fulfilled. Checking the access log, I saw this: >> >> [12/Feb/2008:01:47:58 +0000] conn=71108 op=-1 fd=79 closed error 107 (Transport endpoint is not connected) - Network file descriptor is not connected. >> [12/Feb/2008:01:47:59 +0000] conn=71007 op=60 fd=69 closed - B4 >> [12/Feb/2008:01:48:00 +0000] conn=71003 op=48 fd=68 closed - B4 >> [12/Feb/2008:01:48:01 +0000] conn=71017 op=47 fd=72 closed - B4 >> [12/Feb/2008:01:48:06 +0000] conn=71102 op=2 fd=66 closed - B4 >> [12/Feb/2008:01:48:07 +0000] conn=71103 op=2 fd=70 closed - B4 >> [12/Feb/2008:01:48:07 +0000] conn=71040 op=10 fd=76 closed - B4 >> >> Any ideas or suggestions on how to approach troubleshooting this issue would be greatly appreciated. >> >> > B4 means SLAPD_DISCONNECT_BER_FLUSH - this usually means the client has reset or closed the connection while the server was attempting to send a response. > > http://www.redhat.com/docs/manuals/dir-server/cli/8.0/Configuration_Command_File_Reference-Access_Log_and_Connection_Code_Reference-Common_Connection_Codes.html > > Do you have a firewall or some other network device? > >> Thanks. >> >> -richard >> >> -- >> Fedora-directory-users mailing list >> Fedora-directory-users@redhat.com >> https://www.redhat.com/mailman/listinfo/fedora-directory-users >> >> > > > -- > Fedora-directory-users mailing list > Fedora-directory-users@redhat.com > https://www.redhat.com/mailman/listinfo/fedora-directory-users >
Rich Megginson
2008-Feb-15 20:23 UTC
Re: [Fedora-directory-users] FDS 1.1 Transport endpoint is not connected
Richard Hesse wrote:> Eh sorry about this but it appears that my original hunch was correct. > The 1.1 DS instance did indeed hang again recently. I was able to > check a localhost query and that failed, too. So the problem > definitely appears to be a hang in the FDS code somewhere. The > question is, how do I go about debugging this? Strace doesn’t show > much at all. Enabling debug trace logging kills the server. Any ideas? > Thanks.What sort of application(s) are you using to generate a load against the directory server? What does logconv.pl /var/log/dirsrv/slapd-instance/access say? If TRACE level logging is too expensive, you might try 8 Connection management http://directory.fedoraproject.org/wiki/FAQ#Troubleshooting
Richard Hesse
2008-Feb-15 20:38 UTC
Re: [Fedora-directory-users] FDS 1.1 Transport endpoint is not connected
Thanks Richard, I¹ll give connection management a whirl.
Here¹s the log parser output (nice util btw):
----------- Access Log Output ------------
Restarts: 0
Total Connections: 4820
Peak Concurrent Connections: 19
Total Operations: 18017
Total Results: 18129
Overall Performance: 100.0%
Searches: 9960
Modifications: 6
Adds: 0
Deletes: 0
Mod RDNs: 0
Persistent Searches: 0
Internal Operations: 0
Entry Operations: 0
Extended Operations: 3224
Abandoned Requests: 0
Smart Referrals Received: 0
VLV Operations: 0
VLV Unindexed Searches: 0
SORT Operations: 0
SSL Connections: 1613
Entire Search Base Queries: 820
Unindexed Searches: 0
FDs Taken: 4828
FDs Returned: 4817
Highest FD Taken: 109
Broken Pipes: 0
Connections Reset By Peer: 0
Resource Unavailable: 17
- 17 (T1) Idle Timeout Exceeded
Binds: 4827
Unbinds: 65
LDAP v2 Binds: 0
LDAP v3 Binds: 4827
SSL Client Binds: 0
Failed SSL Client Binds: 0
SASL Binds: 0
Directory Manager Binds: 0
Anonymous Binds: 4813
Other Binds: 14
On 2/15/08 12:23 PM, "Rich Megginson" <rmeggins@redhat.com>
wrote:
> Richard Hesse wrote:
>> Eh sorry about this but it appears that my original hunch was correct.
>> The 1.1 DS instance did indeed hang again recently. I was able to
>> check a localhost query and that failed, too. So the problem
>> definitely appears to be a hang in the FDS code somewhere. The
>> question is, how do I go about debugging this? Strace doesn¹t show
>> much at all. Enabling debug trace logging kills the server. Any ideas?
>> Thanks.
> What sort of application(s) are you using to generate a load against the
> directory server?
>
> What does logconv.pl /var/log/dirsrv/slapd-instance/access say?
>
> If TRACE level logging is too expensive, you might try 8 Connection
> management
> http://directory.fedoraproject.org/wiki/FAQ#Troubleshooting
>
>
Rich Megginson
2008-Feb-15 20:53 UTC
Re: [Fedora-directory-users] FDS 1.1 Transport endpoint is not connected
Richard Hesse wrote:> Thanks Richard, I¹ll give connection management a whirl. >What is the application which is generating this load?> Here¹s the log parser output (nice util btw): > > ----------- Access Log Output ------------ > > Restarts: 0 > > Total Connections: 4820 > Peak Concurrent Connections: 19 > Total Operations: 18017 > Total Results: 18129 > Overall Performance: 100.0% > > Searches: 9960 > Modifications: 6 > Adds: 0 > Deletes: 0 > Mod RDNs: 0 > > Persistent Searches: 0 > Internal Operations: 0 > Entry Operations: 0 > Extended Operations: 3224 > Abandoned Requests: 0 > Smart Referrals Received: 0 > > VLV Operations: 0 > VLV Unindexed Searches: 0 > SORT Operations: 0 > SSL Connections: 1613 > > Entire Search Base Queries: 820 > Unindexed Searches: 0 > > FDs Taken: 4828 > FDs Returned: 4817 > Highest FD Taken: 109 > > Broken Pipes: 0 > Connections Reset By Peer: 0 > Resource Unavailable: 17 > - 17 (T1) Idle Timeout Exceeded > > Binds: 4827 > Unbinds: 65 > > LDAP v2 Binds: 0 > LDAP v3 Binds: 4827 > SSL Client Binds: 0 > Failed SSL Client Binds: 0 > SASL Binds: 0 > > Directory Manager Binds: 0 > Anonymous Binds: 4813 > Other Binds: 14 > > > > On 2/15/08 12:23 PM, "Rich Megginson" <rmeggins@redhat.com> wrote: > > >> Richard Hesse wrote: >> >>> Eh sorry about this but it appears that my original hunch was correct. >>> The 1.1 DS instance did indeed hang again recently. I was able to >>> check a localhost query and that failed, too. So the problem >>> definitely appears to be a hang in the FDS code somewhere. The >>> question is, how do I go about debugging this? Strace doesn¹t show >>> much at all. Enabling debug trace logging kills the server. Any ideas? >>> Thanks. >>> >> What sort of application(s) are you using to generate a load against the >> directory server? >> >> What does logconv.pl /var/log/dirsrv/slapd-instance/access say? >> >> If TRACE level logging is too expensive, you might try 8 Connection >> management >> http://directory.fedoraproject.org/wiki/FAQ#Troubleshooting >> >> >> > > > -- > Fedora-directory-users mailing list > Fedora-directory-users@redhat.com > https://www.redhat.com/mailman/listinfo/fedora-directory-users >
Richard Hesse
2008-Feb-15 21:59 UTC
Re: [Fedora-directory-users] FDS 1.1 Transport endpoint is not connected
nsswitch posix users/groups, ssh, sudo, puppet (config management), and internally written applications. -richard On 2/15/08 12:53 PM, "Rich Megginson" <rmeggins@redhat.com> wrote:> What is the application which is generating this load?
Rich Megginson
2008-Feb-15 22:11 UTC
Re: [Fedora-directory-users] FDS 1.1 Transport endpoint is not connected
Richard Hesse wrote:> nsswitch posix users/groups,Are you using nscd?> ssh, sudo, puppet (config management), and > internally written applications. > > -richard > > On 2/15/08 12:53 PM, "Rich Megginson" <rmeggins@redhat.com> wrote: > > >> What is the application which is generating this load? >> > > > -- > Fedora-directory-users mailing list > Fedora-directory-users@redhat.com > https://www.redhat.com/mailman/listinfo/fedora-directory-users >
Richard Hesse
2008-Feb-15 22:50 UTC
Re: [Fedora-directory-users] FDS 1.1 Transport endpoint is not connected
Yes, every host (except the ldap hosts) runs nscd. The ldap servers are not configured to use directory data for anything. -richard On 2/15/08 2:11 PM, "Rich Megginson" <rmeggins@redhat.com> wrote:> Richard Hesse wrote: >> nsswitch posix users/groups, > Are you using nscd? >> ssh, sudo, puppet (config management), and >> internally written applications. >> >> -richard >> >> On 2/15/08 12:53 PM, "Rich Megginson" <rmeggins@redhat.com> wrote: >> >> >>> What is the application which is generating this load? >>> >> >> >> -- >> Fedora-directory-users mailing list >> Fedora-directory-users@redhat.com >> https://www.redhat.com/mailman/listinfo/fedora-directory-users >> > >
Rich Megginson
2008-Feb-19 18:23 UTC
Re: [Fedora-directory-users] FDS 1.1 Transport endpoint is not connected
Richard Hesse wrote:> Yes, every host (except the ldap hosts) runs nscd. The ldap servers are not > configured to use directory data for anything. >I just don''t know. I''ve not seen this before. I suppose you could try checking your kernel TCP/IP settings, and increasing the number of file descriptors used - http://directory.fedoraproject.org/wiki/Performance_Tuning> -richard > > > On 2/15/08 2:11 PM, "Rich Megginson" <rmeggins@redhat.com> wrote: > > >> Richard Hesse wrote: >> >>> nsswitch posix users/groups, >>> >> Are you using nscd? >> >>> ssh, sudo, puppet (config management), and >>> internally written applications. >>> >>> -richard >>> >>> On 2/15/08 12:53 PM, "Rich Megginson" <rmeggins@redhat.com> wrote: >>> >>> >>> >>>> What is the application which is generating this load? >>>> >>>> >>> -- >>> Fedora-directory-users mailing list >>> Fedora-directory-users@redhat.com >>> https://www.redhat.com/mailman/listinfo/fedora-directory-users >>> >>> >> > > > -- > Fedora-directory-users mailing list > Fedora-directory-users@redhat.com > https://www.redhat.com/mailman/listinfo/fedora-directory-users >
Richard Hesse
2008-Feb-19 23:02 UTC
Re: [Fedora-directory-users] FDS 1.1 Transport endpoint is not connected
Not much new to report. The server hung again and the only thing in the error log with connection tracing is this: [18/Feb/2008:13:14:03 +0000] - PR_Write(41818752) Netscape Portable Runtime error -5961 (TCP connection reset by peer.) [18/Feb/2008:13:14:03 +0000] - ber_flush failed, error 104 (Connection reset by peer) Which doesn''t look like much. As for network tuning, it''s already been done. Max descriptors is set to 32768. Are there any gdb commands I can run while the server is in a hung state? I''m going to try running strace while the process is working, and hope for a hang. Maybe that will give us some more info. -richard On 2/19/08 10:23 AM, "Rich Megginson" <rmeggins@redhat.com> wrote:> Richard Hesse wrote: >> Yes, every host (except the ldap hosts) runs nscd. The ldap servers are not >> configured to use directory data for anything. >> > I just don''t know. I''ve not seen this before. I suppose you could try > checking your kernel TCP/IP settings, and increasing the number of file > descriptors used - > http://directory.fedoraproject.org/wiki/Performance_Tuning >> -richard >> >> >> On 2/15/08 2:11 PM, "Rich Megginson" <rmeggins@redhat.com> wrote: >> >> >>> Richard Hesse wrote: >>> >>>> nsswitch posix users/groups, >>>> >>> Are you using nscd? >>> >>>> ssh, sudo, puppet (config management), and >>>> internally written applications. >>>> >>>> -richard >>>> >>>> On 2/15/08 12:53 PM, "Rich Megginson" <rmeggins@redhat.com> wrote: >>>> >>>> >>>> >>>>> What is the application which is generating this load? >>>>> >>>>> >>>> -- >>>> Fedora-directory-users mailing list >>>> Fedora-directory-users@redhat.com >>>> https://www.redhat.com/mailman/listinfo/fedora-directory-users >>>> >>>> >>> >> >> >> -- >> Fedora-directory-users mailing list >> Fedora-directory-users@redhat.com >> https://www.redhat.com/mailman/listinfo/fedora-directory-users >> > >
Rich Megginson
2008-Feb-20 00:04 UTC
Re: [Fedora-directory-users] FDS 1.1 Transport endpoint is not connected
Richard Hesse wrote:> Not much new to report. The server hung again and the only thing in the > error log with connection tracing is this: > > [18/Feb/2008:13:14:03 +0000] - PR_Write(41818752) Netscape Portable Runtime > error -5961 (TCP connection reset by peer.) > [18/Feb/2008:13:14:03 +0000] - ber_flush failed, error 104 (Connection reset > by peer) > > Which doesn''t look like much.Well, it tells me that the server was attempting to write to a socket, and got an error. -5961 is PR_CONNECT_RESET_ERROR which can occur if the system call returns either EPIPE or ECONNRESET. And error 104 is indeed ECONNRESET. /usr/include/asm-generic/errno.h:#define ECONNRESET 104 /* Connection reset by peer */ AFAICT, this can happen if the client shuts down the socket (for any number of reasons) but the server is still attempting to send data. In this case, the client will respond with a TCP RST. I''m not sure how or why this could happen. I''m open to other causes for ECONNRESET. What would be really, really interesting is if we could narrow this down to a particular client application and run ethereal on the connection. Are you using SSL?> As for network tuning, it''s already been done. > > Max descriptors is set to 32768. > > Are there any gdb commands I can run while the server is in a hung state? >Sure. For whatever the cause of the ECONNRESET, it should not cause the server to hang, and it would be interesting to find out what it''s doing. You''ll have to install the fedora-ds-base-debuginfo package. Attach to the process - gdb /usr/sbin/ns-slapd <pid of process> Then, dump the thread stacks - (gdb) thread apply all bt If you want the output to go to a file, redirect gdb logging to a file first before doing the thread apply e.g. (gdb) set logging on (gdb) set logging file stack.txt> I''m going to try running strace while the process is working, and hope for a > hang. Maybe that will give us some more info. > > -richard > > On 2/19/08 10:23 AM, "Rich Megginson" <rmeggins@redhat.com> wrote: > > >> Richard Hesse wrote: >> >>> Yes, every host (except the ldap hosts) runs nscd. The ldap servers are not >>> configured to use directory data for anything. >>> >>> >> I just don''t know. I''ve not seen this before. I suppose you could try >> checking your kernel TCP/IP settings, and increasing the number of file >> descriptors used - >> http://directory.fedoraproject.org/wiki/Performance_Tuning >> >>> -richard >>> >>> >>> On 2/15/08 2:11 PM, "Rich Megginson" <rmeggins@redhat.com> wrote: >>> >>> >>> >>>> Richard Hesse wrote: >>>> >>>> >>>>> nsswitch posix users/groups, >>>>> >>>>> >>>> Are you using nscd? >>>> >>>> >>>>> ssh, sudo, puppet (config management), and >>>>> internally written applications. >>>>> >>>>> -richard >>>>> >>>>> On 2/15/08 12:53 PM, "Rich Megginson" <rmeggins@redhat.com> wrote: >>>>> >>>>> >>>>> >>>>> >>>>>> What is the application which is generating this load? >>>>>> >>>>>> >>>>>> >>>>> -- >>>>> Fedora-directory-users mailing list >>>>> Fedora-directory-users@redhat.com >>>>> https://www.redhat.com/mailman/listinfo/fedora-directory-users >>>>> >>>>> >>>>> >>> -- >>> Fedora-directory-users mailing list >>> Fedora-directory-users@redhat.com >>> https://www.redhat.com/mailman/listinfo/fedora-directory-users >>> >>> >> > > > -- > Fedora-directory-users mailing list > Fedora-directory-users@redhat.com > https://www.redhat.com/mailman/listinfo/fedora-directory-users >
Richard Hesse
2008-Feb-20 23:17 UTC
Re: [Fedora-directory-users] FDS 1.1 Transport endpoint is not connected
Yeah, we¹re using SSL and TLS so ethereal/tcpdump isn¹t going to yield much info. The process hung again and strace didn¹t provide too much information other than this: futex(0x20b9260, FUTEX_WAIT, 2, NULL) Would that give you a place to start looking? -richard On 2/19/08 4:04 PM, "Rich Megginson" <rmeggins@redhat.com> wrote:> Richard Hesse wrote: >> Not much new to report. The server hung again and the only thing in the >> error log with connection tracing is this: >> >> [18/Feb/2008:13:14:03 +0000] - PR_Write(41818752) Netscape Portable Runtime >> error -5961 (TCP connection reset by peer.) >> [18/Feb/2008:13:14:03 +0000] - ber_flush failed, error 104 (Connection reset >> by peer) >> >> Which doesn''t look like much. > Well, it tells me that the server was attempting to write to a socket, > and got an error. -5961 is PR_CONNECT_RESET_ERROR which can occur if > the system call returns either EPIPE or ECONNRESET. And error 104 is > indeed ECONNRESET. > /usr/include/asm-generic/errno.h:#define ECONNRESET 104 > /* Connection reset by peer */ > > AFAICT, this can happen if the client shuts down the socket (for any > number of reasons) but the server is still attempting to send data. In > this case, the client will respond with a TCP RST. I''m not sure how or > why this could happen. I''m open to other causes for ECONNRESET. > What would be really, really interesting is if we could narrow this down > to a particular client application and run ethereal on the connection. > > Are you using SSL? >> As for network tuning, it''s already been done. >> >> Max descriptors is set to 32768. >> >> Are there any gdb commands I can run while the server is in a hung state? >> > Sure. For whatever the cause of the ECONNRESET, it should not cause the > server to hang, and it would be interesting to find out what it''s > doing. You''ll have to install the fedora-ds-base-debuginfo package. > Attach to the process - gdb /usr/sbin/ns-slapd <pid of process> > Then, dump the thread stacks - > > (gdb) thread apply all bt > > If you want the output to go to a file, redirect gdb logging to a file > first before doing the thread apply e.g. > > (gdb) set logging on > (gdb) set logging file stack.txt > > >> I''m going to try running strace while the process is working, and hope for a >> hang. Maybe that will give us some more info. >> >> -richard >> >> On 2/19/08 10:23 AM, "Rich Megginson" <rmeggins@redhat.com> wrote: >> >> >>> Richard Hesse wrote: >>> >>>> Yes, every host (except the ldap hosts) runs nscd. The ldap servers are not >>>> configured to use directory data for anything. >>>> >>>> >>> I just don''t know. I''ve not seen this before. I suppose you could try >>> checking your kernel TCP/IP settings, and increasing the number of file >>> descriptors used - >>> http://directory.fedoraproject.org/wiki/Performance_Tuning >>> >>>> -richard >>>> >>>> >>>> On 2/15/08 2:11 PM, "Rich Megginson" <rmeggins@redhat.com> wrote: >>>> >>>> >>>> >>>>> Richard Hesse wrote: >>>>> >>>>> >>>>>> nsswitch posix users/groups, >>>>>> >>>>>> >>>>> Are you using nscd? >>>>> >>>>> >>>>>> ssh, sudo, puppet (config management), and >>>>>> internally written applications. >>>>>> >>>>>> -richard >>>>>> >>>>>> On 2/15/08 12:53 PM, "Rich Megginson" <rmeggins@redhat.com> wrote: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> What is the application which is generating this load? >>>>>>> >>>>>>> >>>>>>> >>>>>> -- >>>>>> Fedora-directory-users mailing list >>>>>> Fedora-directory-users@redhat.com >>>>>> https://www.redhat.com/mailman/listinfo/fedora-directory-users >>>>>> >>>>>> >>>>>> >>>> -- >>>> Fedora-directory-users mailing list >>>> Fedora-directory-users@redhat.com >>>> https://www.redhat.com/mailman/listinfo/fedora-directory-users >>>> >>>> >>> >> >> >> -- >> Fedora-directory-users mailing list >> Fedora-directory-users@redhat.com >> https://www.redhat.com/mailman/listinfo/fedora-directory-users >> > >
Rich Megginson
2008-Feb-20 23:39 UTC
Re: [Fedora-directory-users] FDS 1.1 Transport endpoint is not connected
Richard Hesse wrote:> Yeah, we¹re using SSL and TLS so ethereal/tcpdump isn¹t going to yield much > info.It would give us the TCP/IP protocol data, so we could see what clients and servers are sending the FIN and RST. It''s not so much the LDAP data I care about, although ssltap might be useful for that.> The process hung again and strace didn¹t provide too much information > other than this: > > futex(0x20b9260, FUTEX_WAIT, 2, NULL) > > Would that give you a place to start looking? >That does suggest a possible deadlock.> -richard > > > On 2/19/08 4:04 PM, "Rich Megginson" <rmeggins@redhat.com> wrote: > > >> Richard Hesse wrote: >> >>> Not much new to report. The server hung again and the only thing in the >>> error log with connection tracing is this: >>> >>> [18/Feb/2008:13:14:03 +0000] - PR_Write(41818752) Netscape Portable Runtime >>> error -5961 (TCP connection reset by peer.) >>> [18/Feb/2008:13:14:03 +0000] - ber_flush failed, error 104 (Connection reset >>> by peer) >>> >>> Which doesn''t look like much. >>> >> Well, it tells me that the server was attempting to write to a socket, >> and got an error. -5961 is PR_CONNECT_RESET_ERROR which can occur if >> the system call returns either EPIPE or ECONNRESET. And error 104 is >> indeed ECONNRESET. >> /usr/include/asm-generic/errno.h:#define ECONNRESET 104 >> /* Connection reset by peer */ >> >> AFAICT, this can happen if the client shuts down the socket (for any >> number of reasons) but the server is still attempting to send data. In >> this case, the client will respond with a TCP RST. I''m not sure how or >> why this could happen. I''m open to other causes for ECONNRESET. >> What would be really, really interesting is if we could narrow this down >> to a particular client application and run ethereal on the connection. >> >> Are you using SSL? >> >>> As for network tuning, it''s already been done. >>> >>> Max descriptors is set to 32768. >>> >>> Are there any gdb commands I can run while the server is in a hung state? >>> >>> >> Sure. For whatever the cause of the ECONNRESET, it should not cause the >> server to hang, and it would be interesting to find out what it''s >> doing. You''ll have to install the fedora-ds-base-debuginfo package. >> Attach to the process - gdb /usr/sbin/ns-slapd <pid of process> >> Then, dump the thread stacks - >> >> (gdb) thread apply all bt >> >> If you want the output to go to a file, redirect gdb logging to a file >> first before doing the thread apply e.g. >> >> (gdb) set logging on >> (gdb) set logging file stack.txt >> >> >> >>> I''m going to try running strace while the process is working, and hope for a >>> hang. Maybe that will give us some more info. >>> >>> -richard >>> >>> On 2/19/08 10:23 AM, "Rich Megginson" <rmeggins@redhat.com> wrote: >>> >>> >>> >>>> Richard Hesse wrote: >>>> >>>> >>>>> Yes, every host (except the ldap hosts) runs nscd. The ldap servers are not >>>>> configured to use directory data for anything. >>>>> >>>>> >>>>> >>>> I just don''t know. I''ve not seen this before. I suppose you could try >>>> checking your kernel TCP/IP settings, and increasing the number of file >>>> descriptors used - >>>> http://directory.fedoraproject.org/wiki/Performance_Tuning >>>> >>>> >>>>> -richard >>>>> >>>>> >>>>> On 2/15/08 2:11 PM, "Rich Megginson" <rmeggins@redhat.com> wrote: >>>>> >>>>> >>>>> >>>>> >>>>>> Richard Hesse wrote: >>>>>> >>>>>> >>>>>> >>>>>>> nsswitch posix users/groups, >>>>>>> >>>>>>> >>>>>>> >>>>>> Are you using nscd? >>>>>> >>>>>> >>>>>> >>>>>>> ssh, sudo, puppet (config management), and >>>>>>> internally written applications. >>>>>>> >>>>>>> -richard >>>>>>> >>>>>>> On 2/15/08 12:53 PM, "Rich Megginson" <rmeggins@redhat.com> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> What is the application which is generating this load? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> -- >>>>>>> Fedora-directory-users mailing list >>>>>>> Fedora-directory-users@redhat.com >>>>>>> https://www.redhat.com/mailman/listinfo/fedora-directory-users >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>> -- >>>>> Fedora-directory-users mailing list >>>>> Fedora-directory-users@redhat.com >>>>> https://www.redhat.com/mailman/listinfo/fedora-directory-users >>>>> >>>>> >>>>> >>> -- >>> Fedora-directory-users mailing list >>> Fedora-directory-users@redhat.com >>> https://www.redhat.com/mailman/listinfo/fedora-directory-users >>> >>> >> > > > -- > Fedora-directory-users mailing list > Fedora-directory-users@redhat.com > https://www.redhat.com/mailman/listinfo/fedora-directory-users >
Rich Megginson
2008-Feb-26 03:31 UTC
Re: [Fedora-directory-users] FDS 1.1 Transport endpoint is not connected
Richard Hesse wrote:> Yeah, we¹re using SSL and TLS so ethereal/tcpdump isn¹t going to yield much > info. The process hung again and strace didn¹t provide too much information > other than this: > > futex(0x20b9260, FUTEX_WAIT, 2, NULL) > > Would that give you a place to start looking? >Try logconv.pl -V /var/log/dirsrv/slapd-instancename/access> -richard > > > On 2/19/08 4:04 PM, "Rich Megginson" <rmeggins@redhat.com> wrote: > > >> Richard Hesse wrote: >> >>> Not much new to report. The server hung again and the only thing in the >>> error log with connection tracing is this: >>> >>> [18/Feb/2008:13:14:03 +0000] - PR_Write(41818752) Netscape Portable Runtime >>> error -5961 (TCP connection reset by peer.) >>> [18/Feb/2008:13:14:03 +0000] - ber_flush failed, error 104 (Connection reset >>> by peer) >>> >>> Which doesn''t look like much. >>> >> Well, it tells me that the server was attempting to write to a socket, >> and got an error. -5961 is PR_CONNECT_RESET_ERROR which can occur if >> the system call returns either EPIPE or ECONNRESET. And error 104 is >> indeed ECONNRESET. >> /usr/include/asm-generic/errno.h:#define ECONNRESET 104 >> /* Connection reset by peer */ >> >> AFAICT, this can happen if the client shuts down the socket (for any >> number of reasons) but the server is still attempting to send data. In >> this case, the client will respond with a TCP RST. I''m not sure how or >> why this could happen. I''m open to other causes for ECONNRESET. >> What would be really, really interesting is if we could narrow this down >> to a particular client application and run ethereal on the connection. >> >> Are you using SSL? >> >>> As for network tuning, it''s already been done. >>> >>> Max descriptors is set to 32768. >>> >>> Are there any gdb commands I can run while the server is in a hung state? >>> >>> >> Sure. For whatever the cause of the ECONNRESET, it should not cause the >> server to hang, and it would be interesting to find out what it''s >> doing. You''ll have to install the fedora-ds-base-debuginfo package. >> Attach to the process - gdb /usr/sbin/ns-slapd <pid of process> >> Then, dump the thread stacks - >> >> (gdb) thread apply all bt >> >> If you want the output to go to a file, redirect gdb logging to a file >> first before doing the thread apply e.g. >> >> (gdb) set logging on >> (gdb) set logging file stack.txt >> >> >> >>> I''m going to try running strace while the process is working, and hope for a >>> hang. Maybe that will give us some more info. >>> >>> -richard >>> >>> On 2/19/08 10:23 AM, "Rich Megginson" <rmeggins@redhat.com> wrote: >>> >>> >>> >>>> Richard Hesse wrote: >>>> >>>> >>>>> Yes, every host (except the ldap hosts) runs nscd. The ldap servers are not >>>>> configured to use directory data for anything. >>>>> >>>>> >>>>> >>>> I just don''t know. I''ve not seen this before. I suppose you could try >>>> checking your kernel TCP/IP settings, and increasing the number of file >>>> descriptors used - >>>> http://directory.fedoraproject.org/wiki/Performance_Tuning >>>> >>>> >>>>> -richard >>>>> >>>>> >>>>> On 2/15/08 2:11 PM, "Rich Megginson" <rmeggins@redhat.com> wrote: >>>>> >>>>> >>>>> >>>>> >>>>>> Richard Hesse wrote: >>>>>> >>>>>> >>>>>> >>>>>>> nsswitch posix users/groups, >>>>>>> >>>>>>> >>>>>>> >>>>>> Are you using nscd? >>>>>> >>>>>> >>>>>> >>>>>>> ssh, sudo, puppet (config management), and >>>>>>> internally written applications. >>>>>>> >>>>>>> -richard >>>>>>> >>>>>>> On 2/15/08 12:53 PM, "Rich Megginson" <rmeggins@redhat.com> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> What is the application which is generating this load? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> -- >>>>>>> Fedora-directory-users mailing list >>>>>>> Fedora-directory-users@redhat.com >>>>>>> https://www.redhat.com/mailman/listinfo/fedora-directory-users >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>> -- >>>>> Fedora-directory-users mailing list >>>>> Fedora-directory-users@redhat.com >>>>> https://www.redhat.com/mailman/listinfo/fedora-directory-users >>>>> >>>>> >>>>> >>> -- >>> Fedora-directory-users mailing list >>> Fedora-directory-users@redhat.com >>> https://www.redhat.com/mailman/listinfo/fedora-directory-users >>> >>> >> > > > -- > Fedora-directory-users mailing list > Fedora-directory-users@redhat.com > https://www.redhat.com/mailman/listinfo/fedora-directory-users >