On Mon, Oct 24, 2016 at 1:32 PM, Matt Garman <matthew.garman at gmail.com> wrote:> On Sun, Oct 23, 2016 at 8:02 AM, Larry Martell <larry.martell at gmail.com> wrote: >>> To be clear: the python script is moving files on the same NFS file >>> system? E.g., something like >>> >>> mv /mnt/nfs-server/dir1/file /mnt/nfs-server/dir2/file >>> >>> where /mnt/nfs-server is the mount point of the NFS server on the >>> client machine? >> >> Correct. >> >>> Or are you moving files from the CentOS 7 NFS server to the CentOS 6 NFS client? >> >> No the files are FTP-ed to the CentOS 7 NFS server and then processed >> and moved on the CentOS 6 NFS client. > > > I apologize if I'm being dense here, but I'm more confused on this > data flow now. Your use of "correct" and "no" seems to be > inconsistent with your explanation. Sorry!I though you were asking "Are you doing: A: moving files on the same NFS filesystem , or B: moving them across filesystems? And I replied, "Correct I am doing A, no I am not doing B." The script moves the files from /mnt/nfs-server/dir1/file to /mnt/nfs-server/dir2/file.> At any rate, what I was looking at was seeing if there was any way to > simplify this process, and cut NFS out of the picture. If you need > only to push these files around, what about rsync?It's not just moving files around. The files are read, and their contents are loaded into a MySQL database.>> The problem doing that is the files are processed and loaded to MySQL >> and then moved by a script that uses the Django ORM, and neither >> django, nor any of the other python packages needed are installed on >> the server. And since the server does not have an external internet >> connection (as I mentioned in my reply to Mark) getting it set up >> would require a large amount of effort. > > ...right, but I'm pretty sure rsync should be installed on the server; > I believe it's default in all except the "minimal" setup profiles. > Either way, it's trivial to install, as I don't think it has any > dependencies. You can download the rsync rpm from mirror.centos.org, > then scp it to the server, then install via yum. And Python is > definitely installed (requirement for yum) and Perl is probably > installed as well, so with rsync plus some basic Perl/Python scripting > you can create your own mover script. > > Actually, rsync may not even be necessary, scp may be sufficient for > your purposes. And scp should definitely be installed.This site is not in any way connected to the internet, and you cannot bring in any computers, phones, or media of any kind. There is a process to get machines or files in, but it is onerous and time consuming. This system was set up and configured off site and then brought on site. To run the script on the C7 NFS server instead of the C6 NFS client many python libs will have to installed. I do have someone off site working on setting up a local yum repo with what I need, and then we are going to see if we can zip and email the repo and get it on site. But none of us are sys admins and we don't really know what we're doing so we may not succeed and it may take longer then I will be here in Japan (I am scheduled to leave Saturday).>> Also, we have this exact same setup on over 10 other systems, and it >> is only this one that is having a problem. The one difference with >> this one is that the sever is CentOS7 - on all the other systems both >> the NFS server and client are CentOS6. > > From what you've described so far, with what appears to be a > relatively simple config, C6 or C7 "shouldn't" matter. However, under > the hood, C6 and C7 are quite different. > >> The python script checks the modification time of the file, and only >> if it has not been modified in more then 2 minutes does it process it. >> Otherwise it skips it and waits for the next run to potentially >> process it. Also, the script can tell if the file is incomplete in a >> few different ways. So if it has not been modified in more then 2 >> minutes, the script starts to process it, but if it finds that it's >> incomplete it aborts the processing and leaves it for next time. > > This script runs on C7 or C6?C6> >> The hardware is new, and is in a rack in a server room with adequate >> and monitored cooling and power. But I just found out from someone on >> site that there is a disk failure, which happened back on Sept 3. The >> system uses RAID, but I don't know what level. I was told it can >> tolerate 3 disk failures and still keep working, but personally, I >> think all bets are off until the disk has been replaced. That should >> happen in the next day or 2, so we shall see. > > OK, depending on the RAID scheme and how it's implemented, there could > be disk timeouts causing things to hang.Yes, that's why when I found about the disk failure I wanted to hold off doing anything until the disk gets replaced. But as that is not happening until Wenesday afternoon I think I want to try Mark's nobarrier conifg option today.>> I've been watching and monitoring the machines for 2 days and neither >> one has had a large CPU load, not has been using much memory. > > How about iostat? Also, good old "dmesg" can suggest if the system > with the failed drive is causing timeouts to occur.Nothing in dmesg or /var/log/messages about the failed disk at all. I only saw that when I got on the Integrated Management Module console. But the logs only go back to Sep 21 and the disk failed on Sep 3. The logs only have the NFS errors, no other errors.> > >> None on the client. On the server it has 1 dropped Rx packet. >> >>> Do >>>> "ethtool <interface>" on both machines to make sure both are linked up >>>> at the correct speed and duplex. >> >> That reports only "Link detected: yes" for both client and server. > > OK, but ethtool should also say something like: > > ... > Speed: 1000Mb/s > Duplex: Full > ...No it outputs just the one line: Link detected: yes> For a 1gbps network. If Duplex is reported as "half", then that is > definitely a problem. Using netperf is further confirmation of > whether or not your network is functioning as expected. > > >> sar seems to be running, but I can only get it to report on the >> current day. The man page shows start and end time options, but is >> there a way to specify the stand and end date? > > If you want to report on a day in the past, you have to pass the file > argument, something like this: > > sar -A -f /var/log/sa/sa23 -s 07:00:00 -e 08:00:00 > > That would show you yesterday's data between 7am and 8am. The files > in /var/log/sa/saXX are the files that correspond to the day. By > default, XX will be the day of the month.OK, Thanks.
On Mon, Oct 24, 2016 at 2:42 PM, Larry Martell <larry.martell at gmail.com> wrote:>> At any rate, what I was looking at was seeing if there was any way to >> simplify this process, and cut NFS out of the picture. If you need >> only to push these files around, what about rsync? > > It's not just moving files around. The files are read, and their > contents are loaded into a MySQL database.On what server does the MySQL database live?> This site is not in any way connected to the internet, and you cannot > bring in any computers, phones, or media of any kind. There is a > process to get machines or files in, but it is onerous and time > consuming. This system was set up and configured off site and then > brought on site.But clearly you have a means to log in to both the C6 and C7 servers, right? Otherwise, how would be able to see these errors, check top/sar/free/iostat/etc? And if you are logging in to both of these boxes, I assume you are doing so via ssh? Or are you actually physically sitting in front of these machines? If you have ssh access to these machines, then you can trivially copy files to/from them. If ssh is installed and working, then scp should also be installed and working. Even if you don't have scp, you can use tar over ssh to the same effect. It's ugly, but doable, and there are examples online for how to do it. Also: you made a couple comments about these machines, it looks like the C7 box (FTP server + NFS server) is running bare metal (i.e. not a virtual machine). The C6 instance (NFS client) is virtualized. What host is the C6 instance? Is the C6 instance running under the C7 instance? I.e., are both machines on the same physical hardware? If that is true, then your "network" (at least the one between C7 and C6) is basically virtual, and to have issues like this on the same physical box is certainly indicative of a mis-configuration.> To run the script on the C7 NFS server instead of the C6 NFS client > many python libs will have to installed. I do have someone off site > working on setting up a local yum repo with what I need, and then we > are going to see if we can zip and email the repo and get it on site. > But none of us are sys admins and we don't really know what we're > doing so we may not succeed and it may take longer then I will be here > in Japan (I am scheduled to leave Saturday).Right, but my point is you can write your own custom script(s) to copy files from C7 to C6 (based on rsync or ssh), do the processing on C6 (DB loading, whatever other processing), then move back to C7 if necessary. You said yourself you are a programmer not a sysadmin, so change the nature of the problem from a sysadmin problem to a programming problem. I'm certain I'm missing something, but the fundamental architecture doesn't make sense to me given what I understand of the process flow. Were you able to run some basic network testing tools between the C6 and C7 machines? I'm interested specifically in netperf, which does round trip packet testing, both TCP and UDP. I would look for packet drops with UDP, and/or major performance outliers with TCP, and/or any kind of timeouts with either protocol. How is name resolution working on both machines? Do you address machines by hostname (e.g., "my_c6_server"), or explicitly by IP address? Are you using DNS or are the IPs hard-coded in /etc/hosts? To me it still "smells" like a networking issue... -Matt
Another alternative idea: you probably won't be comfortable with this,
but check out systemd-nspawn.  There are lots of examples online, and
even I wrote about how I use it:
    http://raw-sewage.net/articles/fedora-under-centos/
This is unfortunately another "sysadmin" solution to your problem.
nspawn is the successor to chroot, if you are at all familiar with
that.  It's kinda-sorta like running a system-within-a-system, but
much more lightweight.  The "slave" systems share the running kernel
with the "master" system.  (I could say the "guest" and
"host"
systems, but those are virtual machine terms, and this is not a
virtual machine.)  For your particular case, the main benefit is that
you can natively share filesystems, rather than use NFS to share
files.
So, it's clear you have network capability between the C6 and C7
systems.  And surely you must have ssh installed on both systems.
Therefore, you can transfer files between C6 and C7.  So here's a way
you can use systemd-nspawn to get around trying to install all the
extra libs you need on C7:
    1. On the C7 machine, create a systemd-nspawn container.  This
container will "run" C6.
    2. You can source everything you need from the running C6 system
directly.  Heck, if you have enough disk space on the C7 system, you
could just replicate the whole C6 tree to a sub-directory on C7.
    3. When you configure the C6 nspawn container, make sure you pass
through the directory structure with these FTP'ed files.  Basically
you are substituting systemd-nspawn's bind/filesystem pass-through
mechanism in place of NFS.
With that setup, you can "probably" run all the C6 native stuff under
C7.  This isn't guaranteed to work, e.g. if your C6 programs require
hooks into the kernel, it could fail, because now you're running on a
different kernel... but if you only use userspace libraries, you'll
probably be OK.  But I was actually able to get HandBrake, compiled
for bleeding-edge Ubuntu, to work within a C7 nspawn container.
That probably trades one bit of complexity (NFS) for another
(systemd-nspawn).  But just throwing it out there if you're completely
stuck.
On Mon, Oct 24, 2016 at 5:25 PM, Matt Garman <matthew.garman at gmail.com> wrote:> On Mon, Oct 24, 2016 at 2:42 PM, Larry Martell <larry.martell at gmail.com> wrote: >>> At any rate, what I was looking at was seeing if there was any way to >>> simplify this process, and cut NFS out of the picture. If you need >>> only to push these files around, what about rsync? >> >> It's not just moving files around. The files are read, and their >> contents are loaded into a MySQL database. > > On what server does the MySQL database live?The C6 host, same one that the script runs on. We can of course access the MySQL server from the C7 host, assuming the needed packages are there.>> This site is not in any way connected to the internet, and you cannot >> bring in any computers, phones, or media of any kind. There is a >> process to get machines or files in, but it is onerous and time >> consuming. This system was set up and configured off site and then >> brought on site. > > But clearly you have a means to log in to both the C6 and C7 servers, > right? Otherwise, how would be able to see these errors, check > top/sar/free/iostat/etc? > > And if you are logging in to both of these boxes, I assume you are > doing so via ssh? > > Or are you actually physically sitting in front of these machines?The machines are on a local network. I access them with putty from a windows machine, but I have to be at the site to do that.> If you have ssh access to these machines, then you can trivially copy > files to/from them. If ssh is installed and working, then scp should > also be installed and working. Even if you don't have scp, you can > use tar over ssh to the same effect. It's ugly, but doable, and there > are examples online for how to do it. > > Also: you made a couple comments about these machines, it looks like > the C7 box (FTP server + NFS server) is running bare metal (i.e. not a > virtual machine). The C6 instance (NFS client) is virtualized.Correct.> What host is the C6 instance? > > Is the C6 instance running under the C7 instance? I.e., are both > machines on the same physical hardware? If that is true, then your > "network" (at least the one between C7 and C6) is basically virtual, > and to have issues like this on the same physical box is certainly > indicative of a mis-configuration.Yes, the C6 instance is running on the C7 machine. What could be mis-configured? What would I check to find out?>> To run the script on the C7 NFS server instead of the C6 NFS client >> many python libs will have to installed. I do have someone off site >> working on setting up a local yum repo with what I need, and then we >> are going to see if we can zip and email the repo and get it on site. >> But none of us are sys admins and we don't really know what we're >> doing so we may not succeed and it may take longer then I will be here >> in Japan (I am scheduled to leave Saturday). > > Right, but my point is you can write your own custom script(s) to copy > files from C7 to C6 (based on rsync or ssh), do the processing on C6 > (DB loading, whatever other processing), then move back to C7 if > necessary. You said yourself you are a programmer not a sysadmin, so > change the nature of the problem from a sysadmin problem to a > programming problem.Yes, that is potential solution I had not thought of. The issue with this is that we have the same system installed at many, many sites, and they all work fine. It is only this site that is having an issue. We really do not want to have different SW running at just this one site. Running the script on the C7 host is a change, but at least it will be the same software as every place else.> I'm certain I'm missing something, but the fundamental architecture > doesn't make sense to me given what I understand of the process flow. > > Were you able to run some basic network testing tools between the C6 > and C7 machines? I'm interested specifically in netperf, which does > round trip packet testing, both TCP and UDP. I would look for packet > drops with UDP, and/or major performance outliers with TCP, and/or any > kind of timeouts with either protocol.netperf is not installed.> How is name resolution working on both machines? Do you address > machines by hostname (e.g., "my_c6_server"), or explicitly by IP > address? Are you using DNS or are the IPs hard-coded in /etc/hosts?Everything is by ip address.> To me it still "smells" like a networking issue...