thr3ads.net - CentOS - [CentOS] nfs (or tcp or scheduler) changes between centos 5 and 6? [Apr 2015]

If this information is useful, please help other people find it:
Share via:

Matt Garman

2015-Apr-29 13:35 UTC

[CentOS] nfs (or tcp or scheduler) changes between centos 5 and 6?

We have a "compute cluster" of about 100 machines that do a read-only
NFS mount to a big NAS filer (a NetApp FAS6280).  The jobs running on
these boxes are analysis/simulation jobs that constantly read data off
the NAS.

We recently upgraded all these machines from CentOS 5.7 to CentOS 6.5.
We did a "piecemeal" upgrade, usually upgrading five or so machines at
a time, every few days.  We noticed improved performance on the CentOS
6 boxes.  But as the number of CentOS 6 boxes increased, we actually
saw performance on the CentOS 5 boxes decrease.  By the time we had
only a few CentOS 5 boxes left, they were performing so badly as to be
effectively worthless.

What we observed in parallel to this upgrade process was that the read
latency on our NetApp device skyrocketed.  This in turn caused all
compute jobs to actually run slower, as it seemed to move the
bottleneck from the client servers' OS to the NetApp.  This is
somewhat counter-intuitive: CentOS 6 performs faster, but actually
results in net performance loss because it creates a bottleneck on our
centralized storage.

All indications are that CentOS 6 seems to be much more "aggressive"
in how it does NFS reads.  And likewise, CentOS 5 was very "polite",
to the point that it basically got starved out by the introduction of
the 6.5 boxes.

What I'm looking for is a "deep dive" list of changes to the NFS
implementation between CentOS 5 and CentOS 6.  Or maybe this is due to
a change in the TCP stack?  Or maybe the scheduler?  We've tried a lot
of sysctl tcp tunings, various nfs mount options, anything that's
obviously different between 5 and 6... But so far we've been unable to
find the "smoking gun" that causes the obvious behavior change between
the two OS versions.

Just hoping that maybe someone else out there has seen something like
this, or can point me to some detailed documentation that might clue
me in on what to look for next.

Thanks!

m.roth at 5-cent.us

2015-Apr-29 15:00 UTC

head link

[CentOS] nfs (or tcp or scheduler) changes between centos 5 and 6?

Matt Garman wrote:> We have a "compute cluster" of about 100 machines that do a
read-only
> NFS mount to a big NAS filer (a NetApp FAS6280).  The jobs running on
> these boxes are analysis/simulation jobs that constantly read data off
> the NAS.
>
> We recently upgraded all these machines from CentOS 5.7 to CentOS 6.5.
> We did a "piecemeal" upgrade, usually upgrading five or so
machines at
> a time, every few days.  We noticed improved performance on the CentOS
> 6 boxes.  But as the number of CentOS 6 boxes increased, we actually
> saw performance on the CentOS 5 boxes decrease.  By the time we had
> only a few CentOS 5 boxes left, they were performing so badly as to be
> effectively worthless.
>
> What we observed in parallel to this upgrade process was that the read
> latency on our NetApp device skyrocketed.  This in turn caused all
> compute jobs to actually run slower, as it seemed to move the
> bottleneck from the client servers' OS to the NetApp.  This is
> somewhat counter-intuitive: CentOS 6 performs faster, but actually
> results in net performance loss because it creates a bottleneck on our
> centralized storage.<snip>
*IF* I understand you, I've got one question: what parms are you using to
mount the storage? We had *real* performance problems when we went from 5
to 6 - as in, unzipping a 26M file to 107M, while writing to an
NFS-mounted drive, went from 30 sec or so to a *timed* 7 min. The final
answer was that once we mounted the NFS filesystem with nobarrier in fstab
instead of default, the time dropped to 35 or 40 sec again.

barrier is in 6, and tries to make writes atomic transactions; its intent
is to protect in case of things like power failure. Esp. if you're on
UPSes, nobarrier is the way to go.

       mark

Devin Reade

2015-Apr-29 15:36 UTC

head link

[CentOS] nfs (or tcp or scheduler) changes between centos 5 and 6?

--On Wednesday, April 29, 2015 08:35:29 AM -0500 Matt Garman 
<matthew.garman at gmail.com> wrote:
> All indications are that CentOS 6 seems to be much more
"aggressive"
> in how it does NFS reads.  And likewise, CentOS 5 was very
"polite",
> to the point that it basically got starved out by the introduction of
> the 6.5 boxes.
Some things come to mind as far as investigating differences; you don't
have to answer them all here; just making sure you've covered them all:

Have you looked at the client-side NFS cache?  Perhaps the C6 cache
is either disabled, has fewer resources, or is invalidating faster?
(I don't think that would explain the C5 starvation, though, unless
it's a secondary effect from retransmits, etc.)

Regarding the cache, do you have multiple mount points on a client
that resolve to the same server filesystem?  If so, do they have
different mount options?  If so, that can result in multiple caches
instead of a single disk cache.  The client cache can also be bypassed
if your application is doing direct I/O on the files.  Perhaps there
is a difference in the application between C5 and C6, including
whether or not it was just recompiled?  (If so, can you try a C5 version
on the C6 machines?)

If you determine that C6 is doing aggressive caching, does this match
the needs of your application?  That is, do you have the situation
where the client NFS layer does an aggressive read-ahead that is never
used by the application?

Are C5 and C6 using the same NFS protocol version?  How about TCP vs
UDP?  If UDP is in play, have a look at fragmentation stats under load.

Are both using the same authentication method (ie: maybe just
UID-based)?

And, like always, is DNS sane for all your clients and servers?  Everything
(including clients) has proper PTR records, consistent with A records,
et al?  DNS is so fundamental to everything that if it is out of whack
you can get far-reaching symptoms that don't seem to have anything to do
with DNS.

<http://wiki.linux-nfs.org> has helpful information about enabling debug
output on the client end to see what is going on.  I don't know in your
situation if enabling server-side debugging is feasible.
<http://nfs.sourceforge.net> also has useful tuning information.

You may want to look at NFSometer and see if it can help.

Devin

James Pearson

2015-Apr-29 15:39 UTC

head link

[CentOS] nfs (or tcp or scheduler) changes between centos 5 and 6?

m.roth at 5-cent.us wrote:> Matt Garman wrote:
> 
>>We have a "compute cluster" of about 100 machines that do a
read-only
>>NFS mount to a big NAS filer (a NetApp FAS6280).  The jobs running on
>>these boxes are analysis/simulation jobs that constantly read data off
>>the NAS.
> 
> <snip>
> *IF* I understand you, I've got one question: what parms are you using
to
> mount the storage? We had *real* performance problems when we went from 5
> to 6 - as in, unzipping a 26M file to 107M, while writing to an
> NFS-mounted drive, went from 30 sec or so to a *timed* 7 min. The final
> answer was that once we mounted the NFS filesystem with nobarrier in fstab
> instead of default, the time dropped to 35 or 40 sec again.
> 
> barrier is in 6, and tries to make writes atomic transactions; its intent
> is to protect in case of things like power failure. Esp. if you're on
> UPSes, nobarrier is the way to go.
The server in this case isn't a Linux box with an ext4 file system - so 
that won't help ...

James Pearson

Matt Garman

2015-Apr-29 16:32 UTC

head link

[CentOS] nfs (or tcp or scheduler) changes between centos 5 and 6?

On Wed, Apr 29, 2015 at 10:36 AM, Devin Reade <gdr at gno.org>
wrote:> Have you looked at the client-side NFS cache?  Perhaps the C6 cache
> is either disabled, has fewer resources, or is invalidating faster?
> (I don't think that would explain the C5 starvation, though, unless
> it's a secondary effect from retransmits, etc.)
Do you know where the NFS cache settings are specified?  I've looked
at the various nfs mount options.  Anything cache-related appears to
be the same between the two OSes, assuming I didn't miss anything.  We
did experiment with the "noac" mount option, though that had no effect
in our tests.

FWIW, we've done a tcpdump on both OSes, performing the same tasks,
and it appears that 5 actually has more "chatter".  Just looking at
packet counts, 5 has about 17% more packets than 6, for the same
workload.  I haven't dug too deep into the tcpdump files, since we
need a pretty big workload to trigger the measurable performance
discrepancy.  So the resulting pcap files are on the order of 5 GB.
> Regarding the cache, do you have multiple mount points on a client
> that resolve to the same server filesystem?  If so, do they have
> different mount options?  If so, that can result in multiple caches
> instead of a single disk cache.  The client cache can also be bypassed
> if your application is doing direct I/O on the files.  Perhaps there
> is a difference in the application between C5 and C6, including
> whether or not it was just recompiled?  (If so, can you try a C5 version
> on the C6 machines?)
No multiple mount points to the same server.

No application differences.  We're still compiling on 5, regardless of
target platform.
> If you determine that C6 is doing aggressive caching, does this match
> the needs of your application?  That is, do you have the situation
> where the client NFS layer does an aggressive read-ahead that is never
> used by the application?
That was one of our early theories.  On 6, you can adjust this via
/sys/class/bdi/X:Y/read_ahead_kb (use stat on the mountpoint to
determine X and Y).  This file doesn't exist on 5.  But we tried
increasing and decreasing it from the default (960), and didn't see
any changes.
> Are C5 and C6 using the same NFS protocol version?  How about TCP vs
> UDP?  If UDP is in play, have a look at fragmentation stats under load.
Yup, both are using tcp, protocol version 3.
> Are both using the same authentication method (ie: maybe just
> UID-based)?
Yup, sec=sys.
> And, like always, is DNS sane for all your clients and servers?  Everything
> (including clients) has proper PTR records, consistent with A records,
> et al?  DNS is so fundamental to everything that if it is out of whack
> you can get far-reaching symptoms that don't seem to have anything to
do
> with DNS.
I believe so.  I wouldn't bet my life on it.  But there were certainly
no changes to our DNS before, during or since the OS upgrade.
> You may want to look at NFSometer and see if it can help.
Haven't seen that, will definitely give it a try!

Thanks for your thoughts and suggestions!

lhecking at users.sourceforge.net

2015-Apr-30 08:47 UTC

head link

[CentOS] nfs (or tcp or scheduler) changes between centos 5 and 6?

Also check out NetApp performance monitors, e.g. autoupport web site or
 trusty old filer-mrtg. NFS ops and cpu load might be an indication of
 things going wrong at the NetApp end - you might run into particular
 bugs and want to upgrade to the latest patch level of the OS.

Apparently Analagous Threads

Search for more reasonably related threads

CentOS - Apr 2015 - nfs (or tcp or scheduler) changes between centos 5 and 6?

[CentOS] nfs (or tcp or scheduler) changes between centos 5 and 6?

[CentOS] nfs (or tcp or scheduler) changes between centos 5 and 6?

[CentOS] nfs (or tcp or scheduler) changes between centos 5 and 6?

[CentOS] nfs (or tcp or scheduler) changes between centos 5 and 6?

[CentOS] nfs (or tcp or scheduler) changes between centos 5 and 6?

[CentOS] nfs (or tcp or scheduler) changes between centos 5 and 6?

Apparently Analagous Threads