Erick Perez - Quadrian Enterprises
2019-Aug-30 16:27 UTC
[CentOS] NFS Server on Centos 7.6.1810 dying without reason
Good morning, in order to post proper documentation, what logs (or log level) do I need to troubleshoot a Centos 7.6.1810 3.10.0-957.27.2.el7.x86_64 tha tis running a NFS server on top LVM on top of XFS on top of VDO on top of MDAM on a 6 SSD disk RAID6 ? This physical NFS server is servign 2 NFS v4.2 shares to 2 physical KVM virtualization hosts. When remote NFS clients start doing intensive stuff (massive writes) , the NFS server crashes and sometimes make the Linux server reboot. form KVM side this is all I get: hvm002 kernel: nfs: server 10.10.10.2 not responding, timed out thanks --------------------- Erick ---------------------
Erick Perez - Quadrian Enterprises
2019-Aug-30 21:25 UTC
[CentOS] NFS Server on Centos 7.6.1810 dying without reason
I managed to get NFS statistics just before server rebooted. I can reproduce this every time. Question: Am I saturating the SATA bus on the NFS server so violently with the SSDs that the server hard reboots? Perhaps due to the RAID 6 nature of operations? Server info AMD EPYC 7281 16CORE 32 GIG RAM 6xSATA3 SSD 1.9TB in mdadm RAID6 OS: Centos 7.6 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51) NFS server parameters: /etc/sysconfig/nfs RPCNFSDCOUNT=32 all other parameters are as default in a Centos 7.6 install>From NFS client, share is mounted as:10.10.10.2:/mnt/lv_vdo0 on /rhev/data-center/mnt/10.10.10.2:_mnt_lv__vdo0 type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,soft,nosharecache,proto=tcp,timeo=600,retrans=6,sec=sys,clientaddr=10.10.10.5,local_lock=none,addr=10.10.10.2) 1- My test: a- mount nfs share A1 from nfs server in nfs client b- mount nfs share B1 from SAME nfs server in SAME nfs client c- copy file1.iso from /mnt/A1/file1.iso to /mnt/B1/file1.iso Notes: if file is small, tried with 1gb iso images, then no problem happens. if file is bigger tried with 3gb iso, nfs server hard reboots before server hard reboots, iostats show 16 nfsd worker threads and 211.03 M/s READS and 223.68 M/s WRITES sustained 2- from NFS server these are the last logs... At 14:26:52 it is the last line before the NFS server suddenly hard reboots. Aug 30 14:16:12 hvmnas01 kernel: nfsd4_sequence: slotid 1 Aug 30 14:16:12 hvmnas01 kernel: check_slot_seqid enter. seqid 1641 slot_seqid 1640 Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op ffff9c8bfcaa8080 opcnt 3 #1: 53: status 0 Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op #2/3: 22 (OP_PUTFH) Aug 30 14:16:12 hvmnas01 kernel: nfsd: fh_verify(20: 00060001 9799be55 5a4051e2 b5f423b9 801ff55f 00000000) Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op ffff9c8bfcaa8080 opcnt 3 #2: 22: status 0 Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op ffff9c8b6a7be080 opcnt 3 #2: 22: status 0 Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op #3/3: 9 (OP_GETATTR) Aug 30 14:16:12 hvmnas01 kernel: nfsd: fh_verify(20: 00060001 9799be55 5a4051e2 b5f423b9 801ff55f 00000000) Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op #3/3: 9 (OP_GETATTR) Aug 30 14:16:12 hvmnas01 kernel: nfsd: fh_verify(20: 00060001 9799be55 5a4051e2 b5f423b9 801ff55f 00000000) Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op ffff9c8b6a7be080 opcnt 3 #3: 9: status 0 Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound returned 0 Aug 30 14:16:12 hvmnas01 kernel: --> nfsd4_store_cache_entry slot ffff9c8bd7bbd000 Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op ffff9c8bfcaa8080 opcnt 3 #3: 9: status 0 Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound returned 0 Aug 30 14:16:12 hvmnas01 kernel: --> nfsd4_store_cache_entry slot ffff9c8bd7bba000 Aug 30 14:16:12 hvmnas01 kernel: renewing client (clientid 5d694c2d/12bb1419) Aug 30 14:16:12 hvmnas01 kernel: nfsd_dispatch: vers 4 proc 1 Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op #1/3: 53 (OP_SEQUENCE) Aug 30 14:16:12 hvmnas01 kernel: __find_in_sessionid_hashtbl: 1567181869:314250265:6:0 Aug 30 14:16:12 hvmnas01 kernel: nfsd4_sequence: slotid 0 Aug 30 14:16:12 hvmnas01 kernel: check_slot_seqid enter. seqid 12545 slot_seqid 12544 Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op ffff9c8bfcaa8080 opcnt 3 #1: 53: status 0 Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op #2/3: 22 (OP_PUTFH) Aug 30 14:16:12 hvmnas01 kernel: nfsd: fh_verify(20: 00060001 9799be55 5a4051e2 b5f423b9 801ff55f 00000000) Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op ffff9c8bfcaa8080 opcnt 3 #2: 22: status 0 Aug 30 14:16:12 hvmnas01 kernel: nfsd_dispatch: vers 4 proc 1 Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op #1/3: 53 (OP_SEQUENCE) Aug 30 14:16:12 hvmnas01 kernel: __find_in_sessionid_hashtbl: 1567181869:314250265:6:0 Aug 30 14:16:12 hvmnas01 kernel: nfsd4_sequence: slotid 1 Aug 30 14:16:12 hvmnas01 kernel: check_slot_seqid enter. seqid 1642 slot_seqid 1641 Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op ffff9c8b6a7be080 opcnt 3 #1: 53: status 0 Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op #2/3: 22 (OP_PUTFH) Aug 30 14:16:12 hvmnas01 kernel: nfsd: fh_verify(20: 00060001 9799be55 5a4051e2 b5f423b9 801ff55f 00000000) Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op #3/3: 9 (OP_GETATTR) Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op ffff9c8b6a7be080 opcnt 3 #2: 22: status 0 Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op #3/3: 9 (OP_GETATTR) Aug 30 14:16:12 hvmnas01 kernel: nfsd: fh_verify(20: 00060001 9799be55 5a4051e2 b5f423b9 801ff55f 00000000) Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op ffff9c8b6a7be080 opcnt 3 #3: 9: status 0 Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound returned 0 Aug 30 14:16:12 hvmnas01 kernel: --> nfsd4_store_cache_entry slot ffff9c8bd7bba000 Aug 30 14:16:12 hvmnas01 kernel: nfsd: fh_verify(20: 00060001 9799be55 5a4051e2 b5f423b9 801ff55f 00000000) Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op ffff9c8bfcaa8080 opcnt 3 #3: 9: status 0 Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound returned 0 Aug 30 14:16:12 hvmnas01 kernel: --> nfsd4_store_cache_entry slot ffff9c8bd7bbd000 Aug 30 14:16:12 hvmnas01 kernel: renewing client (clientid 5d694c2d/12bb1419) Aug 30 14:16:12 hvmnas01 kernel: nfsd_dispatch: vers 4 proc 1 Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op #1/4: 53 (OP_SEQUENCE) Aug 30 14:16:12 hvmnas01 kernel: __find_in_sessionid_hashtbl: 1567181869:314250264:5:0 Aug 30 14:16:12 hvmnas01 kernel: nfsd4_sequence: slotid 0 Aug 30 14:16:12 hvmnas01 kernel: check_slot_seqid enter. seqid 8036 slot_seqid 8035 Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op ffff9c8bfcaa8080 opcnt 4 #1: 53: status 0 Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op #2/4: 22 (OP_PUTFH) Aug 30 14:16:12 hvmnas01 kernel: nfsd: fh_verify(32: 81060001 87b2cdfe a744333d d3d8b5a0 d514631c 04000440) Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op ffff9c8bfcaa8080 opcnt 4 #2: 22: status 0 Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op #3/4: 3 (OP_ACCESS) Aug 30 14:16:12 hvmnas01 kernel: nfsd: fh_verify(32: 81060001 87b2cdfe a744333d d3d8b5a0 d514631c 04000440) Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op ffff9c8bfcaa8080 opcnt 4 #3: 3: status 0 Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op #4/4: 9 (OP_GETATTR) Aug 30 14:16:12 hvmnas01 kernel: nfsd: fh_verify(32: 81060001 87b2cdfe a744333d d3d8b5a0 d514631c 04000440) Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op ffff9c8bfcaa8080 opcnt 4 #4: 9: status 0 Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound returned 0 Aug 30 14:16:12 hvmnas01 kernel: --> nfsd4_store_cache_entry slot ffff9c8b38ab0000 Aug 30 14:16:12 hvmnas01 kernel: renewing client (clientid 5d694c2d/12bb1418) Aug 30 14:16:12 hvmnas01 kernel: nfsd_dispatch: vers 4 proc 1 Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op #1/3: 53 (OP_SEQUENCE) Aug 30 14:16:12 hvmnas01 kernel: __find_in_sessionid_hashtbl: 1567181869:314250265:6:0 Aug 30 14:16:12 hvmnas01 kernel: nfsd4_sequence: slotid 0 Aug 30 14:16:12 hvmnas01 kernel: check_slot_seqid enter. seqid 12546 slot_seqid 12545 Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op ffff9c8bfcaa8080 opcnt 3 #1: 53: status 0 Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op #2/3: 22 (OP_PUTFH) Aug 30 14:16:12 hvmnas01 kernel: nfsd: fh_verify(20: 00060001 9799be55 5a4051e2 b5f423b9 801ff55f 00000000) Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op ffff9c8bfcaa8080 opcnt 3 #2: 22: status 0 Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op #3/3: 9 (OP_GETATTR) Aug 30 14:16:12 hvmnas01 kernel: nfsd: fh_verify(20: 00060001 9799be55 5a4051e2 b5f423b9 801ff55f 00000000) Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op ffff9c8bfcaa8080 opcnt 3 #3: 9: status 0 Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound returned 0 Aug 30 14:16:12 hvmnas01 kernel: --> nfsd4_store_cache_entry slot ffff9c8bd7bbd000 Aug 30 14:16:12 hvmnas01 kernel: renewing client (clientid 5d694c2d/12bb1419) Aug 30 14:16:12 hvmnas01 kernel: nfsd_dispatch: vers 4 proc 1 Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op #1/3: 53 (OP_SEQUENCE) Aug 30 14:16:12 hvmnas01 kernel: __find_in_sessionid_hashtbl: 1567181869:314250265:6:0 Aug 30 14:16:12 hvmnas01 kernel: nfsd4_sequence: slotid 0 Aug 30 14:16:12 hvmnas01 kernel: check_slot_seqid enter. seqid 12547 slot_seqid 12546 Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op ffff9c8bfcaa8080 opcnt 3 #1: 53: status 0 Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op #2/3: 22 (OP_PUTFH) Aug 30 14:16:12 hvmnas01 kernel: nfsd: fh_verify(20: 00060001 9799be55 5a4051e2 b5f423b9 801ff55f 00000000) Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op ffff9c8bfcaa8080 opcnt 3 #2: 22: status 0 Aug 30 14:16:12 hvmnas01 kernel: nfsv4 compound op #3/3: 9 (OP_GETATTR) Aug 30 14:16:12 hvmnas01 kernel: nfsd: fh_verify(20: 00060001 9799be55 5a4051e2 b5f423b9 801ff55f 00000000) Aug 30 14:26:52 hvmnas01 kernel: do_IRQ: 1.35 No irq handler for vector (irq -1) Aug 30 14:26:52 hvmnas01 rsyslogd: imjournal: 233164 messages lost due to rate-limiting Aug 30 14:30:01 hvmnas01 systemd: Started Session 38 of user root. Aug 30 14:38:29 hvmnas01 journal: Runtime journal is using 8.0M (max allowed 1.5G, trying to leave 2.3G free of 15.5G available ? current limit 1.5G). Aug 30 14:38:29 hvmnas01 kernel: Initializing cgroup subsys cpuset Aug 30 14:38:29 hvmnas01 kernel: Initializing cgroup subsys cpu Aug 30 14:38:29 hvmnas01 kernel: Initializing cgroup subsys cpuacct Aug 30 14:38:29 hvmnas01 kernel: Linux version 3.10.0-957.27.2.el7.x86_64 (mockbuild at kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-36) (GCC) ) #1 SMP Mon Jul 29 17:46:05 UTC 2019 Aug 30 14:38:29 hvmnas01 kernel: Command line: BOOT_IMAGE=/vmlinuz-3.10.0-957.27.2.el7.x86_64 root=/dev/mapper/centos-root ro crashkernel=auto rd.lvm.lv=centos/root rd.lvm.lv=centos/swap rhgb quiet LANG=en_US.UTF-8 skew_tick=1 Aug 30 14:38:29 hvmnas01 kernel: e820: BIOS-provided physical RAM map: --------------------- Erick Perez --------------------- On Fri, Aug 30, 2019 at 11:27 AM Erick Perez - Quadrian Enterprises <eperez at quadrianweb.com> wrote:> > Good morning, > in order to post proper documentation, what logs (or log level) do I > need to troubleshoot a Centos 7.6.1810 3.10.0-957.27.2.el7.x86_64 tha > tis running a NFS server on top LVM on top of XFS on top of VDO on top > of MDAM on a 6 SSD disk RAID6 ? > > This physical NFS server is servign 2 NFS v4.2 shares to 2 physical > KVM virtualization hosts. > > When remote NFS clients start doing intensive stuff (massive writes) , > the NFS server crashes and sometimes make the Linux server reboot. > > form KVM side this is all I get: hvm002 kernel: nfs: server 10.10.10.2 > not responding, timed out > > thanks > > --------------------- > Erick > ---------------------
On Fri, 30 Aug 2019 16:25:40 -0500 Erick Perez - Quadrian Enterprises <eperez at quadrianweb.com> wrote:> I managed to get NFS statistics just before server rebooted. > I can reproduce this every time. > Question: Am I saturating the SATA bus on the NFS server so violently > with the SSDs that the server hard reboots? > Perhaps due to the RAID 6 nature of operations? > > > Server info > AMD EPYC 7281 16CORE > 32 GIG RAM > 6xSATA3 SSD 1.9TB in mdadm RAID6 > OS: Centos 7.6 > SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA > Controller [AHCI mode] (rev 51) > [...]First I'd replace the network card, preferably with something like an Intel e1000. Perhaps use a dedicated network card for the server itself and another card for the KVM clients.