thr3ads.net - freebsd stable - 8.0-RC1 NFS client timeout issue [Oct 2009]

If this information is useful, please help other people find it:
Share via:

Olaf Seibert

2009-Oct-27 16:52 UTC

8.0-RC1 NFS client timeout issue

I see an annoying behaviour with NFS over TCP. It happens both with nfs
and newnfs. This is with FreeBSD/amd64 8.0-RC1 as client. The server is
some Linux or perhaps Solaris, I'm not entirely sure.

After trying to find something in packet traces, I think I have found
something.

The scenario seems to be as follows. Sorry for the width of the lines.


No.     Time        Source                Destination           Protocol Info
   2296 2992.216855 xxx.xxx.31.43         xxx.xxx.16.142        NFS      V3
LOOKUP Call (Reply In 2297), DH:0x3819da36/w
   2297 2992.217107 xxx.xxx.16.142        xxx.xxx.31.43         NFS      V3
LOOKUP Reply (Call In 2296) Error:NFS3ERR_NOENT
   2298 2992.217141 xxx.xxx.31.43         xxx.xxx.16.142        NFS      V3
LOOKUP Call (Reply In 2299), DH:0x170cb16a/bin
   2299 2992.217334 xxx.xxx.16.142        xxx.xxx.31.43         NFS      V3
LOOKUP Reply (Call In 2298), FH:0x61b8eb12
   2300 2992.217361 xxx.xxx.31.43         xxx.xxx.16.142        NFS      V3
ACCESS Call (Reply In 2301), FH:0x61b8eb12
   2301 2992.217582 xxx.xxx.16.142        xxx.xxx.31.43         NFS      V3
ACCESS Reply (Call In 2300)
   2302 2992.217605 xxx.xxx.31.43         xxx.xxx.16.142        NFS      V3
LOOKUP Call (Reply In 2303), DH:0x61b8eb12/w
   2303 2992.217860 xxx.xxx.16.142        xxx.xxx.31.43         NFS      V3
LOOKUP Reply (Call In 2302) Error:NFS3ERR_NOENT
   2304 2992.318770 xxx.xxx.31.43         xxx.xxx.16.142        TCP      934
> nfs [ACK] Seq=238293 Ack=230289 Win=8192 Len=0 TSV=86492342 TSER=12393434
   2306 3011.537520 xxx.xxx.16.142        xxx.xxx.31.43         NFS      V3
GETATTR Reply (Call In 2305)  Directory mode:2755 uid:4100 gid:4100
   2307 3011.637744 xxx.xxx.31.43         xxx.xxx.16.142        TCP      934
> nfs [ACK] Seq=238429 Ack=230405 Win=8192 Len=0 TSV=86511662 TSER=12395366
   2308 3371.534980 xxx.xxx.16.142        xxx.xxx.31.43         TCP      nfs
> 934 [FIN, ACK] Seq=230405 Ack=238429 Win=49232 Len=0 TSV=12431366
TSER=86511662

	The server decides, for whatever reason, to terminate the
	connection and sends a FIN.

   2309 3371.535018 xxx.xxx.31.43         xxx.xxx.16.142        TCP      934
> nfs [ACK] Seq=238429 Ack=230406 Win=8192 Len=0 TSV=86871578 TSER=12431366

	Client acknowledges this,

   2310 3375.379693 xxx.xxx.31.43         xxx.xxx.16.142        NFS      V3
ACCESS Call, FH:0x008002a2

	but tries to sneak in another call anyway.	[A]

   2311 3375.474788 xxx.xxx.16.142        xxx.xxx.31.43         TCP      nfs
> 934 [ACK] Seq=230406 Ack=238569 Win=49232 Len=0 TSV=12431760 TSER=86875423

	Server ACKs but doesn't send anything else... [B]

	Time passes...

   2312 3675.366081 xxx.xxx.31.43         xxx.xxx.16.142        TCP      934
> nfs [FIN, ACK] Seq=238569 Ack=230406 Win=8192 Len=0 TSV=87175425
TSER=12431760

	Client finally decides after 300 secs to close the connection
	too

   2313 3675.366149 xxx.xxx.31.43         xxx.xxx.16.142        TCP      904
> nfs [SYN] Seq=0 Win=65535 Len=0 MSS=1460 WS=5 TSV=87175425 TSER=0

	and to re-open a new one.

   2314 3675.366318 xxx.xxx.16.142        xxx.xxx.31.43         TCP      nfs
> 934 [ACK] Seq=230406 Ack=238570 Win=49232 Len=0 TSV=12461749 TSER=87175425
   2315 3675.366446 xxx.xxx.16.142        xxx.xxx.31.43         TCP      nfs
> 904 [SYN, ACK] Seq=0 Ack=1 Win=49232 Len=0 TSV=12461749 TSER=87175425
MSS=1460 WS=0
   2316 3675.366483 xxx.xxx.31.43         xxx.xxx.16.142        TCP      904
> nfs [ACK] Seq=1 Ack=1 Win=66592 Len=0 TSV=87175425 TSER=12461749
   2317 3675.366506 xxx.xxx.31.43         xxx.xxx.16.142        NFS      V3
ACCESS Call (Reply In 2319), FH:0x008002a2
   2318 3675.366660 xxx.xxx.16.142        xxx.xxx.31.43         TCP      nfs
> 904 [ACK] Seq=1 Ack=141 Win=49092 Len=0 TSV=12461749 TSER=87175425
   2319 3675.367356 xxx.xxx.16.142        xxx.xxx.31.43         NFS      V3
ACCESS Reply (Call In 2317)
   2320 3675.367425 xxx.xxx.31.43         xxx.xxx.16.142        NFS      V3
GETATTR Call (Reply In 2322), FH:0x170cb16a
   2321 3675.367644 xxx.xxx.16.142        xxx.xxx.31.43         TCP      nfs
> 904 [ACK] Seq=125 Ack=277 Win=49232 Len=0 TSV=12461749 TSER=87175426
   2322 3675.367730 xxx.xxx.16.142        xxx.xxx.31.43         NFS      V3
GETATTR Reply (Call In 2320)  Directory mode:2755 uid:4100 gid:4100
   2323 3675.367759 xxx.xxx.31.43         xxx.xxx.16.142        NFS      V3
ACCESS Call (Reply In 2325), FH:0x170cb16a

Point [A] seems somwehat worrisome to me:

Though technically the connection is closed in one direction
only, the intention of the server seems clear, and it would be better to
be careful and make a new connection right away.

[B] would be a bug of the server in my opinion. If it ACKs a call, it
should send a reply. And if it can't, it shouldn't.

Please Cc me on replies, I am not subscribed to this list.

-Olaf.
--

Claus Guttesen

2009-Oct-27 17:22 UTC

head link

8.0-RC1 NFS client timeout issue

> I see an annoying behaviour with NFS over TCP. It happens both with nfs
> and newnfs. This is with FreeBSD/amd64 8.0-RC1 as client. The server is
> some Linux or perhaps Solaris, I'm not entirely sure.
I used nfs with tcp on a 7.2-client without problems on a solaris
nfs-server. When I upgraded to RC1 I had 'server not responding -
alive again' messages so I swithced to udp which works flawlessly. I
haven't had time to investigate it though.
-- 
regards
Claus

When lenity and cruelty play for a kingdom,
the gentler gamester is the soonest winner.

Shakespeare

twitter.com/kometen

Rick Macklem

2009-Oct-28 20:31 UTC

head link

8.0-RC1 NFS client timeout issue

First off, I know that cross posting is evil, but I wanted to try
and make sure developers saw it.

On Tue, 27 Oct 2009, Olaf Seibert wrote:
> I see an annoying behaviour with NFS over TCP. It happens both with nfs
> and newnfs. This is with FreeBSD/amd64 8.0-RC1 as client. The server is
> some Linux or perhaps Solaris, I'm not entirely sure.
>
> After trying to find something in packet traces, I think I have found
> something.
>
> The scenario seems to be as follows. Sorry for the width of the lines.
>
>
> No.     Time        Source                Destination           Protocol
Info
>   2296 2992.216855 xxx.xxx.31.43         xxx.xxx.16.142        NFS      V3
LOOKUP Call (Reply In 2297), DH:0x3819da36/w
>   2297 2992.217107 xxx.xxx.16.142        xxx.xxx.31.43         NFS      V3
LOOKUP Reply (Call In 2296) Error:NFS3ERR_NOENT
>   2298 2992.217141 xxx.xxx.31.43         xxx.xxx.16.142        NFS      V3
LOOKUP Call (Reply In 2299), DH:0x170cb16a/bin
>   2299 2992.217334 xxx.xxx.16.142        xxx.xxx.31.43         NFS      V3
LOOKUP Reply (Call In 2298), FH:0x61b8eb12
>   2300 2992.217361 xxx.xxx.31.43         xxx.xxx.16.142        NFS      V3
ACCESS Call (Reply In 2301), FH:0x61b8eb12
>   2301 2992.217582 xxx.xxx.16.142        xxx.xxx.31.43         NFS      V3
ACCESS Reply (Call In 2300)
>   2302 2992.217605 xxx.xxx.31.43         xxx.xxx.16.142        NFS      V3
LOOKUP Call (Reply In 2303), DH:0x61b8eb12/w
>   2303 2992.217860 xxx.xxx.16.142        xxx.xxx.31.43         NFS      V3
LOOKUP Reply (Call In 2302) Error:NFS3ERR_NOENT
>   2304 2992.318770 xxx.xxx.31.43         xxx.xxx.16.142        TCP      934
> nfs [ACK] Seq=238293 Ack=230289 Win=8192 Len=0 TSV=86492342 TSER=12393434
>   2306 3011.537520 xxx.xxx.16.142        xxx.xxx.31.43         NFS      V3
GETATTR Reply (Call In 2305)  Directory mode:2755 uid:4100 gid:4100
>   2307 3011.637744 xxx.xxx.31.43         xxx.xxx.16.142        TCP      934
> nfs [ACK] Seq=238429 Ack=230405 Win=8192 Len=0 TSV=86511662 TSER=12395366
>   2308 3371.534980 xxx.xxx.16.142        xxx.xxx.31.43         TCP      nfs
> 934 [FIN, ACK] Seq=230405 Ack=238429 Win=49232 Len=0 TSV=12431366
TSER=86511662
>
> 	The server decides, for whatever reason, to terminate the
> 	connection and sends a FIN.
>
>   2309 3371.535018 xxx.xxx.31.43         xxx.xxx.16.142        TCP      934
> nfs [ACK] Seq=238429 Ack=230406 Win=8192 Len=0 TSV=86871578 TSER=12431366
>
> 	Client acknowledges this,
>
>   2310 3375.379693 xxx.xxx.31.43         xxx.xxx.16.142        NFS      V3
ACCESS Call, FH:0x008002a2
>
> 	but tries to sneak in another call anyway.	[A]
>Probably not the best behaviour, but I think it is technically allowed by 
TCP. (My TCP is very rusty, but I think the socket should be in
TCPS_CLOSE_WAIT at this point and the BSD code will have called
socantrcvmore(), but not socantsndmore().)
>   2311 3375.474788 xxx.xxx.16.142        xxx.xxx.31.43         TCP      nfs
> 934 [ACK] Seq=230406 Ack=238569 Win=49232 Len=0 TSV=12431760 TSER=86875423
>
> 	Server ACKs but doesn't send anything else... [B]
>
> 	Time passes...
>This is where it seems interesting. It looks to me like the socket upcall
for receiving the FIN would have happened before this point, setting the
ct_error.re_status to RPC_CANTRECV, but the code in clnt_vc_call() doesn't
check for this. (It does check for it happening during and after the
sosend(), but not before it, from what I can see.)
>
> [B] would be a bug of the server in my opinion. If it ACKs a call, it
> should send a reply. And if it can't, it shouldn't.
>I'll leave this one for the TCP wizzards. I'm not sure what the
correct behaviour is when data is received on a connection. (I think
it is waiting for a FIN from the client side at this point.)

If you could try the following patch and see if it helps, that would be
appreciated, rick
ps: I'll try to reproduce the situation here, but I'm not sure if I can.
--- rpc/clnt_vc.c.sav	2009-10-28 15:44:20.000000000 -0400
+++ rpc/clnt_vc.c	2009-10-28 15:49:57.000000000 -0400
@@ -413,6 +413,19 @@

  	cr->cr_xid = xid;
  	mtx_lock(&ct->ct_lock);
+	/*
+	 * Check to see if the other end has already started to close down
+	 * the connection. If it happens after this point, it will be
+	 * detected below, when cr->cr_error is checked.
+	 */
+	if (ct->ct_error.re_status == RPC_CANTRECV) {
+		if (errp != &ct->ct_error) {
+			errp->re_errno = ct->ct_error.re_errno;
+			errp->re_status = RPC_CANTRECV;
+		}
+		stat = RPC_CANTRECV;
+		goto out;
+	}
  	TAILQ_INSERT_TAIL(&ct->ct_pending, cr, cr_link);
  	mtx_unlock(&ct->ct_lock);

Rick Macklem

2009-Nov-23 21:37 UTC

head link

8.0-RC1 NFS client timeout issue

On Tue, 27 Oct 2009, Olaf Seibert wrote:
> I see an annoying behaviour with NFS over TCP. It happens both with nfs
> and newnfs. This is with FreeBSD/amd64 8.0-RC1 as client. The server is
> some Linux or perhaps Solaris, I'm not entirely sure.
>
[good stuff snipped...]>
> Though technically the connection is closed in one direction
> only, the intention of the server seems clear, and it would be better to
> be careful and make a new connection right away.
>I believe that r199293 committed to stable/8 should fix this. It did not
make it into FreeBSD8.0, so users of FreeBSD8.0 will need to switch to
using "udp" or apply the patch themselves, if slow reconnects after a
non-FreeBSD NFS server are causing them grief.

rick

freebsd stable - Oct 2009 - 8.0-RC1 NFS client timeout issue

8.0-RC1 NFS client timeout issue

8.0-RC1 NFS client timeout issue

8.0-RC1 NFS client timeout issue

8.0-RC1 NFS client timeout issue