On Fri, Oct 21, 2016 at 11:42 AM, <m.roth at 5-cent.us> wrote:> Larry Martell wrote: >> On Fri, Oct 21, 2016 at 11:21 AM, <m.roth at 5-cent.us> wrote: >>> Larry Martell wrote: >>>> We have 1 system ruining Centos7 that is the NFS server. There are 50 >>>> external machines that FTP files to this server fairly continuously. >>>> >>>> We have another system running Centos6 that mounts the partition the >>>> files >>>> are FTP-ed to using NFS. >>> <snip> >>> What filesystem? >> >> Sorry for being dense, but I am not a sys admin, I am programmer and >> we have no sys admin. I don't know what you mean by your question. I >> am NFS mounting to what ever the default filesystem would be on a >> CentOS6 system. > > This *is* a sysadmin issue. Each partition is formatted as a specific type > of filesystem. The standard Linux filesystems for Upsteam-descended have > been ext3, then ext4, and now xfs. Tools to manipulate xfs will not work > with extx, and vice versa. > > cat /etc/fstab on the systems, and see what they are. If either is xfs, > and assuming that the systems are on UPSes, then the fstab which controls > drive mounting on a system should have, instead of "defaults", > nobarrier,inode64.The server is xfs (the client is nfs). The server does have inode64 specified, but not nobarrier.> Note that the inode64 is relevant if the filesystem is > 2TB.The file system is 51TB.> The reason I say this is that we we started rolling out CentOS 7, we tried > to put one of our user's home directory on one, and it was a disaster. > 100% repeatedly, untarring a 100M tarfile onto an nfs-mounted drive took > seven minutes, where before, it had taken 30 seconds. Timed. It took us > months to discover that NFS 4 tries to make transactions atomic, which is > fine if you're worrying about losing power or connectivity. If you're on a > UPS, and hardwired, adding the nobarrier immediately brought it down to 40 > seconds or so.We are not seeing a performance issue - do you think nobarrier would help with our lock up issue? I wanted to try it but my client did not want me to make any changes until we got the bad disk replaced. Unfortunately that will not happen until Wednesday.
On 10/24/16 03:52, Larry Martell wrote:> On Fri, Oct 21, 2016 at 11:42 AM, <m.roth at 5-cent.us> wrote: >> Larry Martell wrote: >>> On Fri, Oct 21, 2016 at 11:21 AM, <m.roth at 5-cent.us> wrote: >>>> Larry Martell wrote: >>>>> We have 1 system ruining Centos7 that is the NFS server. There are 50 >>>>> external machines that FTP files to this server fairly continuously. >>>>> >>>>> We have another system running Centos6 that mounts the partition the >>>>> files are FTP-ed to using NFS.<snip>>>>> What filesystem?<snip>>> cat /etc/fstab on the systems, and see what they are. If either is xfs, >> and assuming that the systems are on UPSes, then the fstab which controls >> drive mounting on a system should have, instead of "defaults", >> nobarrier,inode64. > > The server is xfs (the client is nfs). The server does have inode64 > specified, but not nobarrier. > >> Note that the inode64 is relevant if the filesystem is > 2TB. > > The file system is 51TB. > >> The reason I say this is that we we started rolling out CentOS 7, we tried >> to put one of our user's home directory on one, and it was a disaster. >> 100% repeatedly, untarring a 100M tarfile onto an nfs-mounted drive took >> seven minutes, where before, it had taken 30 seconds. Timed. It took us >> months to discover that NFS 4 tries to make transactions atomic, which is >> fine if you're worrying about losing power or connectivity. If you're on a >> UPS, and hardwired, adding the nobarrier immediately brought it down to 40 >> seconds or so. > > We are not seeing a performance issue - do you think nobarrier would > help with our lock up issue? I wanted to try it but my client did not > want me to make any changes until we got the bad disk replaced. > Unfortunately that will not happen until Wednesday.Absolutely add nobarrier, and see what happens. mark
On 10/24/2016 04:51 AM, mark wrote:> Absolutely add nobarrier, and see what happens.Using "nobarrier" might increase overall write throughput, but it removes an important integrity feature, increasing the risk of filesystem corruption on power loss. I wouldn't recommend doing that unless your system is on a UPS, and you've tested and verified that it will perform an orderly shutdown when the UPS is on battery power and its charge is low.
On Mon, Oct 24, 2016 at 7:51 AM, mark <m.roth at 5-cent.us> wrote:> On 10/24/16 03:52, Larry Martell wrote: >> >> On Fri, Oct 21, 2016 at 11:42 AM, <m.roth at 5-cent.us> wrote: >>> >>> Larry Martell wrote: >>>> >>>> On Fri, Oct 21, 2016 at 11:21 AM, <m.roth at 5-cent.us> wrote: >>>>> >>>>> Larry Martell wrote: >>>>>> >>>>>> We have 1 system ruining Centos7 that is the NFS server. There are 50 >>>>>> external machines that FTP files to this server fairly continuously. >>>>>> >>>>>> We have another system running Centos6 that mounts the partition the >>>>>> files are FTP-ed to using NFS. > > <snip> >>>>> >>>>> What filesystem? > > <snip> >>> >>> cat /etc/fstab on the systems, and see what they are. If either is xfs, >>> and assuming that the systems are on UPSes, then the fstab which controls >>> drive mounting on a system should have, instead of "defaults", >>> nobarrier,inode64. >> >> >> The server is xfs (the client is nfs). The server does have inode64 >> specified, but not nobarrier. >> >>> Note that the inode64 is relevant if the filesystem is > 2TB. >> >> >> The file system is 51TB. >> >>> The reason I say this is that we we started rolling out CentOS 7, we >>> tried >>> to put one of our user's home directory on one, and it was a disaster. >>> 100% repeatedly, untarring a 100M tarfile onto an nfs-mounted drive took >>> seven minutes, where before, it had taken 30 seconds. Timed. It took us >>> months to discover that NFS 4 tries to make transactions atomic, which is >>> fine if you're worrying about losing power or connectivity. If you're on >>> a >>> UPS, and hardwired, adding the nobarrier immediately brought it down to >>> 40 >>> seconds or so. >> >> >> We are not seeing a performance issue - do you think nobarrier would >> help with our lock up issue? I wanted to try it but my client did not >> want me to make any changes until we got the bad disk replaced. >> Unfortunately that will not happen until Wednesday. > > > Absolutely add nobarrier, and see what happens.Finally got to add nobarrier (I'll skip why it took so long), and it looks like this just caused the problem to morph a bit. On the C7 NFS server, besides having 50 external machines ftp-ing files to it, we run 2 jobs: 1 that moves files around (called image_mover) and one that changes perms on some files (called chmod_job). And on the C6 NFS client, besides the job that was hanging (called the importer), we also run a another job (called ftp_job) that ftps files to the C6 machine. The ftp_job had never hung before, but now the importer that used to hang has not (yet) hung, and the ftp_job that had not hung before now is hanging. But the system messages are different. On the C7 server there is a series of messages of the form 'task blocked for >120 seconds' with a stack trace. There is one for each of the following: nfsd, chmod_job, kworker, pure_ftpd, image_mover In each of the stack traces they are blocked on either nfs_write or nfs_flush And on the C6 client there is a similar blocked message for the ftp job, blocked on nfs_flush, then the bad sequence number message I had seen before, and at that point the ftp_job hung.