Le 13/05/2020 ? 02:13, Orion Poplawski a ?crit?:> On 5/12/20 2:46 AM, Patrick B?gou wrote: >> Hi, >> >> I need some help with NFSv4 setup/tuning. I have a dedicated nfs server >> (2 x E5-2620? 8cores/16 threads each, 64GB RAM, 1x10Gb ethernet and 16x >> 8TB HDD) used by two servers and a small cluster (400 cores). All the >> servers are running CentOS 7, the cluster is running CentOS6. >> >> Time to time on the server I get: >> >> ???? ?kernel: NFSD: client xxx.xxx.xxx.xxx testing state ID with >> ???? incorrect client ID >> >> And the client xxx.xxx.xxx.xxx freeze whith: >> >> ???? ?kernel: nfs: server xxxxx.legi.grenoble-inp.fr not responding, >> ???? still trying >> ???? ?kernel: nfs: server xxxxx.legi.grenoble-inp.fr OK >> ???? ?kernel: nfs: server xxxxx.legi.grenoble-inp.fr not responding, >> ???? still trying >> ???? ?kernel: nfs: server xxxxx.legi.grenoble-inp.fr OK >> >> There is a discussion on RedHat7 support about this but only open to >> subscribers. Other searches with google do not provide? useful >> information. > > FYI - you can get access to such info with a free RHEL developers > account. > >Thanks for your suggestion. As the problem is back I've subscribed to reach the full content of this discussion. The answer was "do not use antivirus" :-(. I do not use antivirus as I am CentOS only. Patrick
On 6/1/20 3:08 AM, Patrick B?gou wrote:> Le 13/05/2020 ? 02:13, Orion Poplawski a ?crit?: >> On 5/12/20 2:46 AM, Patrick B?gou wrote: >>> Hi, >>> >>> I need some help with NFSv4 setup/tuning. I have a dedicated nfs server >>> (2 x E5-2620? 8cores/16 threads each, 64GB RAM, 1x10Gb ethernet and 16x >>> 8TB HDD) used by two servers and a small cluster (400 cores). All the >>> servers are running CentOS 7, the cluster is running CentOS6. >>> >>> Time to time on the server I get: >>> >>> ???? ?kernel: NFSD: client xxx.xxx.xxx.xxx testing state ID with >>> ???? incorrect client ID >>> >>> And the client xxx.xxx.xxx.xxx freeze whith: >>> >>> ???? ?kernel: nfs: server xxxxx.legi.grenoble-inp.fr not responding, >>> ???? still trying >>> ???? ?kernel: nfs: server xxxxx.legi.grenoble-inp.fr OK >>> ???? ?kernel: nfs: server xxxxx.legi.grenoble-inp.fr not responding, >>> ???? still trying >>> ???? ?kernel: nfs: server xxxxx.legi.grenoble-inp.fr OK >>> >>> There is a discussion on RedHat7 support about this but only open to >>> subscribers. Other searches with google do not provide? useful >>> information. >> >> FYI - you can get access to such info with a free RHEL developers >> account. >> >> > Thanks for your suggestion. As the problem is back I've subscribed to > reach the full content of this discussion. > > The answer was "do not use antivirus" :-(. I do not use antivirus as I > am CentOS only. > > Patrick >Just curious to see if you have had any luck resolving these issues? I'm afraid that NFS on EL 7 has become much less stable for us recently as well with lots more client access hangs. Orion -- Orion Poplawski Manager of NWRA Technical Systems 720-772-5637 NWRA, Boulder/CoRA Office FAX: 303-415-9702 3380 Mitchell Lane orion at nwra.com Boulder, CO 80301 https://www.nwra.com/
On 2020-06-01 11:08, Patrick B?gou wrote:> >> I need some help with NFSv4 setup/tuning. I have a dedicated nfs server > >> (2 x E5-2620? 8cores/16 threads each, 64GB RAM, 1x10Gb ethernet and 16x > >> 8TB HDD) used by two servers and a small cluster (400 cores). All the > >> servers are running CentOS 7, the cluster is running CentOS6. > >> > >> Time to time on the server I get: > >> > >> ???? ?kernel: NFSD: client xxx.xxx.xxx.xxx testing state ID with > >> ???? incorrect client IDAccording to Red Hat Bugzilla [1], 2015-11-19: "testing state ID with incorrect client ID" means the server thinks a TEST_STATEID op was sent for a stateid associated with a client different from the client associated with the session over which the TEST_STATEID was sent. Perhaps this could be the result of some confusion in the server's data structures but the most straightforward explanation would be just that that's really what the client did (perhaps as a result of a bug in client recovery code?) The above explanation is applicable but unless you're running a rather old kernel that /particular/ bug is not. My understanding of your issue from the thread to date is you've not yet narrowed the issue to the NFS server, the network, the 2 server clients, or the cluster clients. In other words, the corrupt client ID could be tendered by either of the 2 servers, or by the cluster clients, or could be corrupted in transit over the network, or could originate on the NFS server. Correct? According to my notes from a class given by Ted T'so on NFS, always start with checking network health. That should be easy, using interface statistics, eg: $ ifconfig eth3 eth3 Link encap:Ethernet HWaddr A0:36:9F:10:A9:06 inet addr:10.10.1.100 Bcast:10.10.1.255 Mask:255.255.255.0 inet6 addr: fe80::a236:9fff:fe10:a906/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1 RX packets:858295756 errors:0 dropped:0 overruns:0 frame:0 ^^^^^^^^ TX packets:7090386023 errors:0 dropped:0 overruns:0 carrier:0 ^^^^^^^^ collisions:0 txqueuelen:1000 RX bytes:495026510281 (461.0 GiB) TX bytes:10475167734024 (9.5 TiB) $ ip --stats link show eth3 eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP qlen 1000 link/ether a0:36:9f:10:a9:06 brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 495027287320 858296399 0 0 0 112282 ^^^^^^ TX: bytes packets errors dropped carrier collsns 10475167775376 7090386249 0 0 0 0 ^^^^^^ Layer 2 stats can also be checked using ethtool: $ sudo ethtool --statistics eth3 | egrep 'dropped|errors' rx_errors: 0 tx_errors: 0 ... If you've got a clean, healthy network, that leaves the clients or the server. Maybe the clients are asking for the wrong ID. To analyze the client ID given, you could capture traffic at the server using, perhaps: # tcpdump -W 10 -C 10 -w nfs_capture host <client-ipaddr> Then using tshark or wireshark, see if the client is sending consistent client ID's. If so, that would exonerate the clients, leaving as suspect the NFS daemon code in the Linux kernel. Another point that Mr. T'so made (which it sounds like you have covered) is, don't combine an NFS server with another application or service. I mention this only because I'm pedantic and obsessive, or maybe obsessively pedantic. Also worth mentioning: consider specifying no_subtree_check in your NFS exports. And T'so suggested (ca. 2012) using fs_mark (available from the epel repository) to exercize your file systems. Best luck, -- Charles Polisher [1] https://bugzilla.redhat.com/show_bug.cgi?id=1233284
Hi Orion, no, I still have this problem. I delay working on it as I the latest updates have not been installed on the server and on the client. I'll work again on this problem as soon as possible. Thanks Charles for your detailed information on how to track this problem. I'll check all these metrics. I have several clients for this nfs server and the problem seems only to occur from the client using nfs 4.1 in CentOS Linux release 7.7.1908 (Core). The default options used are: rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=194.254.xx.xx,local_lock=none,addr=194.254.yy.yy On olders clients (Red Hat Enterprise Linux Server release 6.7 (Santiago)) default options are: rw,intr,hard,sloppy,vers=4,addr=194.254.xx.xx,clientaddr=194.254.yy.yy The server in CentOS7.6.1810 Will see if the latest updates help to solve the problem. Patrick Le 03/07/2020 ? 00:05, Orion Poplawski a ?crit?:> On 6/1/20 3:08 AM, Patrick B?gou wrote: >> Le 13/05/2020 ? 02:13, Orion Poplawski a ?crit?: >>> On 5/12/20 2:46 AM, Patrick B?gou wrote: >>>> Hi, >>>> >>>> I need some help with NFSv4 setup/tuning. I have a dedicated nfs >>>> server >>>> (2 x E5-2620? 8cores/16 threads each, 64GB RAM, 1x10Gb ethernet and >>>> 16x >>>> 8TB HDD) used by two servers and a small cluster (400 cores). All the >>>> servers are running CentOS 7, the cluster is running CentOS6. >>>> >>>> Time to time on the server I get: >>>> >>>> ????? ?kernel: NFSD: client xxx.xxx.xxx.xxx testing state ID with >>>> ????? incorrect client ID >>>> >>>> And the client xxx.xxx.xxx.xxx freeze whith: >>>> >>>> ????? ?kernel: nfs: server xxxxx.legi.grenoble-inp.fr not responding, >>>> ????? still trying >>>> ????? ?kernel: nfs: server xxxxx.legi.grenoble-inp.fr OK >>>> ????? ?kernel: nfs: server xxxxx.legi.grenoble-inp.fr not responding, >>>> ????? still trying >>>> ????? ?kernel: nfs: server xxxxx.legi.grenoble-inp.fr OK >>>> >>>> There is a discussion on RedHat7 support about this but only open to >>>> subscribers. Other searches with google do not provide? useful >>>> information. >>> >>> FYI - you can get access to such info with a free RHEL developers >>> account. >>> >>> >> Thanks for your suggestion. As the problem is back I've subscribed to >> reach the full content of this discussion. >> >> The answer was "do not use antivirus" :-(. I do not use antivirus as I >> am CentOS only. >> >> Patrick >> > > Just curious to see if you have had any luck resolving these issues? > I'm afraid that NFS on EL 7 has become much less stable for us > recently as well with lots more client access hangs. > > Orion >