On Mon, Oct 24, 2016 at 7:51 AM, mark <m.roth at 5-cent.us>
wrote:> On 10/24/16 03:52, Larry Martell wrote:
>>
>> On Fri, Oct 21, 2016 at 11:42 AM,  <m.roth at 5-cent.us> wrote:
>>>
>>> Larry Martell wrote:
>>>>
>>>> On Fri, Oct 21, 2016 at 11:21 AM,  <m.roth at 5-cent.us>
wrote:
>>>>>
>>>>> Larry Martell wrote:
>>>>>>
>>>>>> We have 1 system ruining Centos7 that is the NFS
server. There are 50
>>>>>> external machines that FTP files to this server fairly
continuously.
>>>>>>
>>>>>> We have another system running Centos6 that mounts the
partition the
>>>>>> files are FTP-ed to using NFS.
>
> <snip>
>>>>>
>>>>> What filesystem?
>
> <snip>
>>>
>>> cat /etc/fstab on the systems, and see what they are. If either is
xfs,
>>> and assuming that the systems are on UPSes, then the fstab which
controls
>>> drive mounting on a system should have, instead of
"defaults",
>>> nobarrier,inode64.
>>
>>
>> The server is xfs (the client is nfs). The server does have inode64
>> specified, but not nobarrier.
>>
>>> Note that the inode64 is relevant if the filesystem is > 2TB.
>>
>>
>> The file system is 51TB.
>>
>>> The reason I say this is that we we started rolling out CentOS 7,
we
>>> tried
>>> to put one of our user's home directory on one, and it was a
disaster.
>>> 100% repeatedly, untarring a 100M tarfile onto an nfs-mounted drive
took
>>> seven minutes, where before, it had taken 30 seconds. Timed. It
took us
>>> months to discover that NFS 4 tries to make transactions atomic,
which is
>>> fine if you're worrying about losing power or connectivity. If
you're on
>>> a
>>> UPS, and hardwired, adding the nobarrier immediately brought it
down to
>>> 40
>>> seconds or so.
>>
>>
>> We are not seeing a performance issue - do you think nobarrier would
>> help with our lock up issue? I wanted to try it but my client did not
>> want me to make any changes until we got the bad disk replaced.
>> Unfortunately that will not happen until Wednesday.
>
>
> Absolutely add nobarrier, and see what happens.
Finally got to add nobarrier (I'll skip why it took so long), and it
looks like this just caused the problem to morph a bit.
On the C7 NFS server, besides having 50 external machines ftp-ing
files to it, we run 2 jobs: 1 that moves files around (called
image_mover) and one that changes perms on some files (called
chmod_job).
And on the C6 NFS client, besides the job that was hanging (called the
importer), we also run a another job (called ftp_job) that ftps files
to the C6 machine. The ftp_job had never hung before, but now the
importer that used to hang has not (yet) hung, and the ftp_job that
had not hung before now is hanging.
But the system messages are different.
On the C7 server there is a series of messages of the form 'task
blocked for >120 seconds' with a stack trace. There is one for each of
the following:
nfsd, chmod_job, kworker, pure_ftpd, image_mover
In each of the stack traces they are blocked on either nfs_write or nfs_flush
And on the C6 client there is a similar blocked message for the ftp
job, blocked on nfs_flush, then the bad sequence number message I had
seen before, and at that point the ftp_job hung.