On Wed, Oct 26, 2016 at 9:35 AM, Matt Garman <matthew.garman at gmail.com> wrote:> On Tue, Oct 25, 2016 at 7:22 PM, Larry Martell <larry.martell at gmail.com> wrote: >> Again, no machine on the internal network that my 2 CentOS hosts are >> on are connected to the internet. I have no way to download anything., >> There is an onerous and protracted process to get files into the >> internal network and I will see if I can get netperf in. > > Right, but do you have physical access to those machines? Do you have > physical access to the machine which on which you use PuTTY to connect > to those machines? If yes to either question, then you can use > another system (that does have Internet access) to download the files > you want, put them on a USB drive (or burn to a CD, etc), and bring > the USB/CD to the C6/C7/PuTTY machines.This site is locked down like no other I have ever seen. You cannot bring anything into the site - no computers, no media, no phone. You have to empty your pockets and go through an airport type naked body scan.> There's almost always a technical way to get files on to (or out of) a > system. :) Now, your company might have *policies* that forbid > skirting around the technical measures that are in place.This is my client's client, and even if I could circumvent their policy I would not do that. They have a zero tolerance policy and if you are caught violating it you are banned for life from the company. And that would not make my client happy.> Here's another way you might be able to test network connectivity > between C6 and C7 without installing new tools: see if both machines > have "nc" (netcat) installed. I've seen this tool referred to as "the > swiss army knife of network testing tools", and that is indeed an apt > description. So if you have that installed, you can hit up the web > for various examples of its use. It's designed to be easily scripted, > so you can write your own tests, and in theory implement something > similar to netperf. > > OK, I just thought of another "poor man's" way to at least do some > sanity testing between C6 and C7: scp. First generate a huge file. > General rule of thumb is at least 2x the amount of RAM in the C7 host. > You could create a tarball of /usr, for example (e.g. "tar czvf > /tmp/bigfile.tar.gz /usr" assuming your /tmp partition is big enough > to hold this). Then, first do this: "time scp /tmp/bigfile.tar.gz > localhost:/tmp/bigfile_copy.tar.gz". This will literally make a copy > of that big file, but will route through most of of the network stack. > Make a note of how long it took. And also be sure your /tmp partition > is big enough for two copies of that big file. > > Now, repeat that, but instead of copying to localhost, copy to the C6 > box. Something like: "time scp /tmp/bigfile.tar.gz <IP address of C6 > host>:/tmp/". Does the time reported differ greatly from when you > copied to localhost? I would expect them to be reasonably close. > (And this is another reason why you want a fairly large file, so the > transfer time is dominated by actual file transfer, rather than the > overhead.) > > Lastly, do the reverse test: log in to the C6 box, and copy the file > back to C7, e.g. "time scp /tmp/bigfile.tar.gz <IP of C7 > host>:/tmp/bigfile_copy2.tar.gz". Again, the time should be > approximately the same for all three transfers. If either or both of > the latter two copies take dramatically longer than the first, then > there's a good chance something is askew with the network config > between C6 and C7. > > Oh... all this time I've been jumping to fancy tests. Have you tried > the simplest form of testing, that is, doing by hand what your scripts > do automatically? In other words, simply try copying files between C6 > and C7 using the existing NFS config? Can you manually trigger the > errors/timeouts you initially posted? Is it when copying lots of > small files? Or when you copy a single huge file? Any kind of file > copying "profile" you can determine that consistently triggers the > error? That could be another clue.These are all good debugging techniques, and I have tried some of them, but I think the issue is load related. There are 50 external machines ftp-ing to the C7 server, 24/7, thousands of files a day. And on the C6 client the script that processes them is running continuously. It will sometimes run for 7 hours then hang, but it has run for as long as 3 days before hanging. I have never been able to reproduce the errors/hanging situation manually. And again, this is only at this site. We have the same software deployed at 10 different sites all doing the same thing, and it all works fine at all of those.
On Thu, Oct 27, 2016 at 1:03 AM, Larry Martell <larry.martell at gmail.com> wrote:> On Wed, Oct 26, 2016 at 9:35 AM, Matt Garman <matthew.garman at gmail.com> wrote: >> On Tue, Oct 25, 2016 at 7:22 PM, Larry Martell <larry.martell at gmail.com> wrote: >>> Again, no machine on the internal network that my 2 CentOS hosts are >>> on are connected to the internet. I have no way to download anything., >>> There is an onerous and protracted process to get files into the >>> internal network and I will see if I can get netperf in. >> >> Right, but do you have physical access to those machines? Do you have >> physical access to the machine which on which you use PuTTY to connect >> to those machines? If yes to either question, then you can use >> another system (that does have Internet access) to download the files >> you want, put them on a USB drive (or burn to a CD, etc), and bring >> the USB/CD to the C6/C7/PuTTY machines. > > This site is locked down like no other I have ever seen. You cannot > bring anything into the site - no computers, no media, no phone. You > have to empty your pockets and go through an airport type naked body > scan. > >> There's almost always a technical way to get files on to (or out of) a >> system. :) Now, your company might have *policies* that forbid >> skirting around the technical measures that are in place. > > This is my client's client, and even if I could circumvent their > policy I would not do that. They have a zero tolerance policy and if > you are caught violating it you are banned for life from the company. > And that would not make my client happy. > >> Here's another way you might be able to test network connectivity >> between C6 and C7 without installing new tools: see if both machines >> have "nc" (netcat) installed. I've seen this tool referred to as "the >> swiss army knife of network testing tools", and that is indeed an apt >> description. So if you have that installed, you can hit up the web >> for various examples of its use. It's designed to be easily scripted, >> so you can write your own tests, and in theory implement something >> similar to netperf. >> >> OK, I just thought of another "poor man's" way to at least do some >> sanity testing between C6 and C7: scp. First generate a huge file. >> General rule of thumb is at least 2x the amount of RAM in the C7 host. >> You could create a tarball of /usr, for example (e.g. "tar czvf >> /tmp/bigfile.tar.gz /usr" assuming your /tmp partition is big enough >> to hold this). Then, first do this: "time scp /tmp/bigfile.tar.gz >> localhost:/tmp/bigfile_copy.tar.gz". This will literally make a copy >> of that big file, but will route through most of of the network stack. >> Make a note of how long it took. And also be sure your /tmp partition >> is big enough for two copies of that big file. >> >> Now, repeat that, but instead of copying to localhost, copy to the C6 >> box. Something like: "time scp /tmp/bigfile.tar.gz <IP address of C6 >> host>:/tmp/". Does the time reported differ greatly from when you >> copied to localhost? I would expect them to be reasonably close. >> (And this is another reason why you want a fairly large file, so the >> transfer time is dominated by actual file transfer, rather than the >> overhead.) >> >> Lastly, do the reverse test: log in to the C6 box, and copy the file >> back to C7, e.g. "time scp /tmp/bigfile.tar.gz <IP of C7 >> host>:/tmp/bigfile_copy2.tar.gz". Again, the time should be >> approximately the same for all three transfers. If either or both of >> the latter two copies take dramatically longer than the first, then >> there's a good chance something is askew with the network config >> between C6 and C7. >> >> Oh... all this time I've been jumping to fancy tests. Have you tried >> the simplest form of testing, that is, doing by hand what your scripts >> do automatically? In other words, simply try copying files between C6 >> and C7 using the existing NFS config? Can you manually trigger the >> errors/timeouts you initially posted? Is it when copying lots of >> small files? Or when you copy a single huge file? Any kind of file >> copying "profile" you can determine that consistently triggers the >> error? That could be another clue. > > These are all good debugging techniques, and I have tried some of > them, but I think the issue is load related. There are 50 external > machines ftp-ing to the C7 server, 24/7, thousands of files a day. And > on the C6 client the script that processes them is running > continuously. It will sometimes run for 7 hours then hang, but it has > run for as long as 3 days before hanging. I have never been able to > reproduce the errors/hanging situation manually. > > And again, this is only at this site. We have the same software > deployed at 10 different sites all doing the same thing, and it all > works fine at all of those.Well I spoke too soon. The importer (the one that was initially hanging that I came here to fix) hung up after running 20 hours. There were no NFS errors or messages on neither the client nor the server. When I restarted it, it hung after 1 minute, Restarted it again and it hung after 20 seconds. After that when I restarted it it hung immediately. Still no NFS errors or messages. I tried running the process on the server and it worked fine. So I have to believe this is related to nobarrier. Tomorrow I will try removing that setting, but I am no closer to solving this and I have to leave Japan Saturday :-( The bad disk still has not been replaced - that is supposed to happen tomorrow, but I won't have enough time after that to draw any conclusions.
On Thu, Oct 27, 2016 at 12:03 AM, Larry Martell <larry.martell at gmail.com> wrote:> This site is locked down like no other I have ever seen. You cannot > bring anything into the site - no computers, no media, no phone. You > ... > This is my client's client, and even if I could circumvent their > policy I would not do that. They have a zero tolerance policy and if > ...OK, no internet for real. :) Sorry I kept pushing this. I made an unflattering assumption that maybe it just hadn't occurred to you how to get files in or out. Sometimes there are "soft" barriers to bringing files in or out: they don't want it to be trivial, but want it to be doable if necessary. But then there are times when they really mean it. I thought maybe the former applied to you, but clearly it's the latter. Apologies.> These are all good debugging techniques, and I have tried some of > them, but I think the issue is load related. There are 50 external > machines ftp-ing to the C7 server, 24/7, thousands of files a day. And > on the C6 client the script that processes them is running > continuously. It will sometimes run for 7 hours then hang, but it has > run for as long as 3 days before hanging. I have never been able to > reproduce the errors/hanging situation manually.If it truly is load related, I'd think you'd see something askew in the sar logs. But if the load tends to spike, rather than be continuous, the sar sampling rate may be too coarse to pick it up.> And again, this is only at this site. We have the same software > deployed at 10 different sites all doing the same thing, and it all > works fine at all of those.Flaky hardware can also cause weird intermittent issues. I know you mentioned before your hardware is fairly new/decent spec; but that doesn't make it immune to manufacturing defects. For example, imagine one voltage regulator that's ever-so-slightly out of spec. It happens. Bad memory is not uncommon and certainly causes all kinds of mysterious issues (though in my experience that tends to result in spontaneous reboots or hard lockups, but truly anything could happen). Ideally, you could take the system offline and run hardware diagnostics, but I suspect that's impossible given your restrictions on taking things in/out of the datacenter. On Thu, Oct 27, 2016 at 3:05 AM, Larry Martell <larry.martell at gmail.com> wrote:> Well I spoke too soon. The importer (the one that was initially > hanging that I came here to fix) hung up after running 20 hours. There > were no NFS errors or messages on neither the client nor the server. > When I restarted it, it hung after 1 minute, Restarted it again and it > hung after 20 seconds. After that when I restarted it it hung > immediately. Still no NFS errors or messages. I tried running the > process on the server and it worked fine. So I have to believe this is > related to nobarrier. Tomorrow I will try removing that setting, but I > am no closer to solving this and I have to leave Japan Saturday :-( > > The bad disk still has not been replaced - that is supposed to happen > tomorrow, but I won't have enough time after that to draw any > conclusions.I've seen behavior like that with disks that are on their way out... basically the system wants to read a block of data, and the disk doesn't read it successfully, so it keeps trying. The kind of disk, what kind of controller it's behind, raid level, and various other settings can all impact this phenomenon, and also how much detail you can see about it. You already know you have one bad disk, so that's kind of an open wound that may or may not be contributing to your bigger, unsolved problem. So that makes me think, you can also do some basic disk benchmarking. iozone and bonnie++ are nice, but I'm guessing they're not installed and you don't have a means to install them. But you can use "dd" to do some basic benchmarking, and that's all but guaranteed to be installed. Similar to network benchmarking, you can do something like: time dd if=/dev/zero of=/tmp/testfile.dat bs=1G count=256 That will generate a 256 GB file. Adjust "bs" and "count" to whatever makes sense. General rule of thumb is you want the target file to be at least 2x the amount of RAM in the system to avoid cache effects from skewing your results. Bigger is even better if you have the space, as it increases the odds of hitting the "bad" part of the disk (if indeed that's the source of your problem). Do that on C6, C7, and if you can a similar machine as a "control" box, it would be ideal. Again, we're looking for outliers, hang-ups, timeouts, etc. +1 to Gordon's suggestion to sanity check MTU sizes. Another random possibility... By somewhat funny coincidence, we have some servers in Japan as well, and were recently banging our heads against the wall with some weird networking issues. The remote hands we had helping us (none of our staff was on site) claimed one or more fiber cables were dusty, enough that it was affecting light levels. They cleaned the cables and the problems went away. Anyway, if you have access to the switches, you should be able to check that light levels are within spec. If you have the ability to take these systems offline temporarily, you can also run "fsck" (file system check) on the C6 and C7 file systems. IIRC, ext4 can do a very basic kind of check on a mounted filesystem. But a deeper/more comprehensive scan requires the FS to be unmounted. Not sure what the rules are for xfs. But C6 uses ext4 by default so you could probably at least run the basic check on that without taking the system offline.