thr3ads.net - CentOS - [CentOS] NFS help [Oct 2016]

If this information is useful, please help other people find it:
Share via:

Matt Garman

2016-Oct-27 20:23 UTC

[CentOS] NFS help

On Thu, Oct 27, 2016 at 12:03 AM, Larry Martell <larry.martell at
gmail.com> wrote:> This site is locked down like no other I have ever seen. You cannot
> bring anything into the site - no computers, no media, no phone. You
> ...
> This is my client's client, and even if I could circumvent their
> policy I would not do that. They have a zero tolerance policy and if
> ...
OK, no internet for real. :) Sorry I kept pushing this.  I made an
unflattering assumption that maybe it just hadn't occurred to you how
to get files in or out.  Sometimes there are "soft" barriers to
bringing files in or out: they don't want it to be trivial, but want
it to be doable if necessary.  But then there are times when they
really mean it.  I thought maybe the former applied to you, but
clearly it's the latter.  Apologies.
> These are all good debugging techniques, and I have tried some of
> them, but I think the issue is load related. There are 50 external
> machines ftp-ing to the C7 server, 24/7, thousands of files a day. And
> on the C6 client the script that processes them is running
> continuously. It will sometimes run for 7 hours then hang, but it has
> run for as long as 3 days before hanging. I have never been able to
> reproduce the errors/hanging situation manually.
If it truly is load related, I'd think you'd see something askew in
the sar logs.  But if the load tends to spike, rather than be
continuous, the sar sampling rate may be too coarse to pick it up.
> And again, this is only at this site. We have the same software
> deployed at 10 different sites all doing the same thing, and it all
> works fine at all of those.
Flaky hardware can also cause weird intermittent issues.  I know you
mentioned before your hardware is fairly new/decent spec; but that
doesn't make it immune to manufacturing defects.  For example, imagine
one voltage regulator that's ever-so-slightly out of spec.  It
happens.  Bad memory is not uncommon and certainly causes all kinds of
mysterious issues (though in my experience that tends to result in
spontaneous reboots or hard lockups, but truly anything could happen).

Ideally, you could take the system offline and run hardware
diagnostics, but I suspect that's impossible given your restrictions
on taking things in/out of the datacenter.

On Thu, Oct 27, 2016 at 3:05 AM, Larry Martell <larry.martell at
gmail.com> wrote:> Well I spoke too soon. The importer (the one that was initially
> hanging that I came here to fix) hung up after running 20 hours. There
> were no NFS errors or messages on neither the client nor the server.
> When I restarted it, it hung after 1 minute, Restarted it again and it
> hung after 20 seconds. After that when I restarted it it hung
> immediately. Still no NFS errors or messages. I tried running the
> process on the server and it worked fine. So I have to believe this is
> related to nobarrier. Tomorrow I will try removing that setting, but I
> am no closer to solving this and I have to leave Japan Saturday :-(
>
> The bad disk still has not been replaced - that is supposed to happen
> tomorrow, but I won't have enough time after that to draw any
> conclusions.
I've seen behavior like that with disks that are on their way out...
basically the system wants to read a block of data, and the disk
doesn't read it successfully, so it keeps trying.  The kind of disk,
what kind of controller it's behind, raid level, and various other
settings can all impact this phenomenon, and also how much detail you
can see about it.  You already know you have one bad disk, so that's
kind of an open wound that may or may not be contributing to your
bigger, unsolved problem.

So that makes me think, you can also do some basic disk benchmarking.
iozone and bonnie++ are nice, but I'm guessing they're not installed
and you don't have a means to install them.  But you can use "dd"
to
do some basic benchmarking, and that's all but guaranteed to be
installed.  Similar to network benchmarking, you can do something
like:
    time dd if=/dev/zero of=/tmp/testfile.dat bs=1G count=256

That will generate a 256 GB file.  Adjust "bs" and "count"
to whatever
makes sense.  General rule of thumb is you want the target file to be
at least 2x the amount of RAM in the system to avoid cache effects
from skewing your results.  Bigger is even better if you have the
space, as it increases the odds of hitting the "bad" part of the disk
(if indeed that's the source of your problem).

Do that on C6, C7, and if you can a similar machine as a "control"
box, it would be ideal.  Again, we're looking for outliers, hang-ups,
timeouts, etc.

+1 to Gordon's suggestion to sanity check MTU sizes.

Another random possibility... By somewhat funny coincidence, we have
some servers in Japan as well, and were recently banging our heads
against the wall with some weird networking issues.  The remote hands
we had helping us (none of our staff was on site) claimed one or more
fiber cables were dusty, enough that it was affecting light levels.
They cleaned the cables and the problems went away.  Anyway, if you
have access to the switches, you should be able to check that light
levels are within spec.

If you have the ability to take these systems offline temporarily, you
can also run "fsck" (file system check) on the C6 and C7 file systems.
IIRC, ext4 can do a very basic kind of check on a mounted filesystem.
But a deeper/more comprehensive scan requires the FS to be unmounted.
Not sure what the rules are for xfs.  But C6 uses ext4 by default so
you could probably at least run the basic check on that without taking
the system offline.

J Martin Rushton

2016-Oct-27 20:47 UTC

head link

[CentOS] NFS help

On 27/10/16 21:23, Matt Garman wrote:
<snip>> 
> If you have the ability to take these systems offline temporarily, you
> can also run "fsck" (file system check) on the C6 and C7 file
systems.
> IIRC, ext4 can do a very basic kind of check on a mounted filesystem.
> But a deeper/more comprehensive scan requires the FS to be unmounted.
> Not sure what the rules are for xfs.  But C6 uses ext4 by default so
> you could probably at least run the basic check on that without taking
> the system offline.
Don't bother with fsck on XFS filesystems.  From the man page
[fsck.xfs(8)]: "XFS is a journaling filesystem and performs recovery at
mount(8)  time if necessary, so fsck.xfs simply exits with a zero exit
status".  If you need a deeper examination use xfs_repair(8) and note
that: "the filesystem to be repaired must be unmounted, otherwise, the
resulting filesystem may be inconsistent or corrupt" (from the man page).

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: OpenPGP digital signature
URL:
<http://lists.centos.org/pipermail/centos/attachments/20161027/c3c7f513/attachment-0001.sig>

m.roth at 5-cent.us

2016-Oct-27 21:16 UTC

head link

[CentOS] NFS help

Matt Garman wrote:> On Thu, Oct 27, 2016 at 12:03 AM, Larry Martell <larry.martell at
gmail.com>
> wrote:
<snip>> On Thu, Oct 27, 2016 at 3:05 AM, Larry Martell <larry.martell at
gmail.com>
> wrote:
>> Well I spoke too soon. The importer (the one that was initially
>> hanging that I came here to fix) hung up after running 20 hours. There
>> were no NFS errors or messages on neither the client nor the server.
>> When I restarted it, it hung after 1 minute, Restarted it again and it
>> hung after 20 seconds. After that when I restarted it it hung
>> immediately. Still no NFS errors or messages. I tried running the
>> process on the server and it worked fine. So I have to believe this is
>> related to nobarrier. Tomorrow I will try removing that setting, but I
>> am no closer to solving this and I have to leave Japan Saturday :-(
>>
>> The bad disk still has not been replaced - that is supposed to happen
>> tomorrow, but I won't have enough time after that to draw any
>> conclusions.
>
> I've seen behavior like that with disks that are on their way out...<snip>
I just had a truly unpleasant thought, speaking of disks. Years ago, we
tried some WD Green drives in our servers, and that was a disaster. In
somewhere between days and weeks, the drives would go offline. I finally
found out what happened: consumer-grade drives are intended for desktops,
and the TLER - how long the drive keeps trying to read or write to a
sector before giving up, marking the sector bad, and going somewhere else
- is two *minutes*. Our servers were expecting the TLER to be 7 *seconds*
or under. Any chance the client cheaped out with any of the drives?

      mark

Larry Martell

2016-Oct-28 05:52 UTC

head link

[CentOS] NFS help

On Thu, Oct 27, 2016 at 4:23 PM, Matt Garman <matthew.garman at gmail.com>
wrote:> On Thu, Oct 27, 2016 at 12:03 AM, Larry Martell <larry.martell at
gmail.com> wrote:
>> This site is locked down like no other I have ever seen. You cannot
>> bring anything into the site - no computers, no media, no phone. You
>> ...
>> This is my client's client, and even if I could circumvent their
>> policy I would not do that. They have a zero tolerance policy and if
>> ...
>
> OK, no internet for real. :) Sorry I kept pushing this.  I made an
> unflattering assumption that maybe it just hadn't occurred to you how
> to get files in or out.  Sometimes there are "soft" barriers to
> bringing files in or out: they don't want it to be trivial, but want
> it to be doable if necessary.  But then there are times when they
> really mean it.  I thought maybe the former applied to you, but
> clearly it's the latter.  Apologies.
>
>> These are all good debugging techniques, and I have tried some of
>> them, but I think the issue is load related. There are 50 external
>> machines ftp-ing to the C7 server, 24/7, thousands of files a day. And
>> on the C6 client the script that processes them is running
>> continuously. It will sometimes run for 7 hours then hang, but it has
>> run for as long as 3 days before hanging. I have never been able to
>> reproduce the errors/hanging situation manually.
>
> If it truly is load related, I'd think you'd see something askew in
> the sar logs.  But if the load tends to spike, rather than be
> continuous, the sar sampling rate may be too coarse to pick it up.
>
>> And again, this is only at this site. We have the same software
>> deployed at 10 different sites all doing the same thing, and it all
>> works fine at all of those.
>
> Flaky hardware can also cause weird intermittent issues.  I know you
> mentioned before your hardware is fairly new/decent spec; but that
> doesn't make it immune to manufacturing defects.  For example, imagine
> one voltage regulator that's ever-so-slightly out of spec.  It
> happens.  Bad memory is not uncommon and certainly causes all kinds of
> mysterious issues (though in my experience that tends to result in
> spontaneous reboots or hard lockups, but truly anything could happen).
>
> Ideally, you could take the system offline and run hardware
> diagnostics, but I suspect that's impossible given your restrictions
> on taking things in/out of the datacenter.
>
> On Thu, Oct 27, 2016 at 3:05 AM, Larry Martell <larry.martell at
gmail.com> wrote:
>> Well I spoke too soon. The importer (the one that was initially
>> hanging that I came here to fix) hung up after running 20 hours. There
>> were no NFS errors or messages on neither the client nor the server.
>> When I restarted it, it hung after 1 minute, Restarted it again and it
>> hung after 20 seconds. After that when I restarted it it hung
>> immediately. Still no NFS errors or messages. I tried running the
>> process on the server and it worked fine. So I have to believe this is
>> related to nobarrier. Tomorrow I will try removing that setting, but I
>> am no closer to solving this and I have to leave Japan Saturday :-(
>>
>> The bad disk still has not been replaced - that is supposed to happen
>> tomorrow, but I won't have enough time after that to draw any
>> conclusions.
>
> I've seen behavior like that with disks that are on their way out...
> basically the system wants to read a block of data, and the disk
> doesn't read it successfully, so it keeps trying.  The kind of disk,
> what kind of controller it's behind, raid level, and various other
> settings can all impact this phenomenon, and also how much detail you
> can see about it.  You already know you have one bad disk, so that's
> kind of an open wound that may or may not be contributing to your
> bigger, unsolved problem.
Just replaced the disk but I am leaving tomorrow so it was decided
that we will run the process on the C7 server, at least for now. I
will probably have to come back here early next year and revisit this.
We are thinking of building a new system back in NY and shipping it
here and swapping them out.>
> So that makes me think, you can also do some basic disk benchmarking.
> iozone and bonnie++ are nice, but I'm guessing they're not
installed
> and you don't have a means to install them.  But you can use
"dd" to
> do some basic benchmarking, and that's all but guaranteed to be
> installed.  Similar to network benchmarking, you can do something
> like:
>     time dd if=/dev/zero of=/tmp/testfile.dat bs=1G count=256
>
> That will generate a 256 GB file.  Adjust "bs" and
"count" to whatever
> makes sense.  General rule of thumb is you want the target file to be
> at least 2x the amount of RAM in the system to avoid cache effects
> from skewing your results.  Bigger is even better if you have the
> space, as it increases the odds of hitting the "bad" part of the
disk
> (if indeed that's the source of your problem).
>
> Do that on C6, C7, and if you can a similar machine as a
"control"
> box, it would be ideal.  Again, we're looking for outliers, hang-ups,
> timeouts, etc.
>
> +1 to Gordon's suggestion to sanity check MTU sizes.
>
> Another random possibility... By somewhat funny coincidence, we have
> some servers in Japan as well, and were recently banging our heads
> against the wall with some weird networking issues.  The remote hands
> we had helping us (none of our staff was on site) claimed one or more
> fiber cables were dusty, enough that it was affecting light levels.
> They cleaned the cables and the problems went away.  Anyway, if you
> have access to the switches, you should be able to check that light
> levels are within spec.
No switches - it's an internal, virtual network between the server and
the virtualized client.
> If you have the ability to take these systems offline temporarily, you
> can also run "fsck" (file system check) on the C6 and C7 file
systems.
> IIRC, ext4 can do a very basic kind of check on a mounted filesystem.
> But a deeper/more comprehensive scan requires the FS to be unmounted.
> Not sure what the rules are for xfs.  But C6 uses ext4 by default so
> you could probably at least run the basic check on that without taking
> the system offline.
The systems were rebooted 2 days ago and fsck was run at boot time,
and was clean.

Thanks for all your help with trying to solve this. It was a very
frustrating time for me - I was here for 10 days and did not really
discover anything about the problem. Hopefully running the process on
the server will keep it going and keep the customer happy.

I will update this thread when/if I revisit it.

Larry Martell

2016-Oct-28 05:59 UTC

head link

[CentOS] NFS help

On Thu, Oct 27, 2016 at 5:16 PM,  <m.roth at 5-cent.us>
wrote:> Matt Garman wrote:
>> On Thu, Oct 27, 2016 at 12:03 AM, Larry Martell <larry.martell at
gmail.com>
>> wrote:
> <snip>
>> On Thu, Oct 27, 2016 at 3:05 AM, Larry Martell <larry.martell at
gmail.com>
>> wrote:
>>> Well I spoke too soon. The importer (the one that was initially
>>> hanging that I came here to fix) hung up after running 20 hours.
There
>>> were no NFS errors or messages on neither the client nor the
server.
>>> When I restarted it, it hung after 1 minute, Restarted it again and
it
>>> hung after 20 seconds. After that when I restarted it it hung
>>> immediately. Still no NFS errors or messages. I tried running the
>>> process on the server and it worked fine. So I have to believe this
is
>>> related to nobarrier. Tomorrow I will try removing that setting,
but I
>>> am no closer to solving this and I have to leave Japan Saturday :-(
>>>
>>> The bad disk still has not been replaced - that is supposed to
happen
>>> tomorrow, but I won't have enough time after that to draw any
>>> conclusions.
>>
>> I've seen behavior like that with disks that are on their way
out...
> <snip>
> I just had a truly unpleasant thought, speaking of disks. Years ago, we
> tried some WD Green drives in our servers, and that was a disaster. In
> somewhere between days and weeks, the drives would go offline. I finally
> found out what happened: consumer-grade drives are intended for desktops,
> and the TLER - how long the drive keeps trying to read or write to a
> sector before giving up, marking the sector bad, and going somewhere else
> - is two *minutes*. Our servers were expecting the TLER to be 7 *seconds*
> or under. Any chance the client cheaped out with any of the drives?
No, it's a fairly high end Lenovo X series server (X3650 I think).

Apparently Analagous Threads

Search for more apparently analagous threads

CentOS - Oct 2016 - NFS help

[CentOS] NFS help

[CentOS] NFS help

[CentOS] NFS help

[CentOS] NFS help

[CentOS] NFS help

Apparently Analagous Threads