Hi Matt- Thank you for this very detailed and thoughtful reply. On Fri, Oct 21, 2016 at 4:43 PM, Matt Garman <matthew.garman at gmail.com> wrote:> On Fri, Oct 21, 2016 at 4:14 AM, Larry Martell <larry.martell at gmail.com> wrote: >> We have 1 system ruining Centos7 that is the NFS server. There are 50 >> external machines that FTP files to this server fairly continuously. >> >> We have another system running Centos6 that mounts the partition the files >> are FTP-ed to using NFS. >> >> There is a python script running on the NFS client machine that is reading >> these files and moving them to a new dir on the same file system (a mv not >> a cp). > > To be clear: the python script is moving files on the same NFS file > system? E.g., something like > > mv /mnt/nfs-server/dir1/file /mnt/nfs-server/dir2/file > > where /mnt/nfs-server is the mount point of the NFS server on the > client machine?Correct.> Or are you moving files from the CentOS 7 NFS server to the CentOS 6 NFS client?No the files are FTP-ed to the CentOS 7 NFS server and then processed and moved on the CentOS 6 NFS client.> If the former, i.e., you are moving files to and from the same system, > is it possible to completely eliminate the C6 client system, and just > set up a local script on the C7 server that does the file moves? That > would cut out a lot of complexity, and also improve performance > dramatically.The problem doing that is the files are processed and loaded to MySQL and then moved by a script that uses the Django ORM, and neither django, nor any of the other python packages needed are installed on the server. And since the server does not have an external internet connection (as I mentioned in my reply to Mark) getting it set up would require a large amount of effort. Also, we have this exact same setup on over 10 other systems, and it is only this one that is having a problem. The one difference with this one is that the sever is CentOS7 - on all the other systems both the NFS server and client are CentOS6.> Also, what is the size range of these files? Are they fairly small > (e.g. 10s of MB or less), medium-ish (100s of MB) or large (>1GB)?Small - They range in size from about 100K to 6M.>> Almost daily this script hangs while reading a file - sometimes it never >> comes back and cannot be killed, even with -9. Other times it hangs for 1/2 >> hour then proceeds on. > > Timeouts relating to NFS are the worst. > > >> Coinciding with the hanging I see this message on the NFS server host: >> >> nfsd: peername failed (error 107) >> >> And on the NFS client host I see this: >> >> nfs: V4 server returned a bad sequence-id >> nfs state manager - check lease failed on NFSv4 server with error 5 > > I've been wrangling with NFS for years, but unfortunately those > particular messages don't ring a bell. > > The first thing that came to my mind is: how does the Python script > running on the C6 client know that the FTP upload to the C7 server is > complete? In other words, if someone is uploading "fileA", and the > Python script starts to move "fileA" before the upload is complete, > then at best you're setting yourself up for all kinds of confusion, > and at worst file truncation and/or corruption.The python script checks the modification time of the file, and only if it has not been modified in more then 2 minutes does it process it. Otherwise it skips it and waits for the next run to potentially process it. Also, the script can tell if the file is incomplete in a few different ways. So if it has not been modified in more then 2 minutes, the script starts to process it, but if it finds that it's incomplete it aborts the processing and leaves it for next time.> Making a pure guess about those particular errors: is there any chance > there is a network issue between the C7 server and the C6 client? > What is the connection between those two servers? Are they physically > adjacent to each other and on the same subnet? Or are they on > opposite ends of the globe connected through the Internet?Actually both the client and server are virtual machines running on one physical machine. The physical machine is running CentOS7. There is nothing else running on the physical machine other then the 2 VMs.> Clearly two machines on the same subnet, separated only by one switch > is the simplest case (i.e. the kind of simple LAN one might have in > his home). But once you start crossing subnets, then routing configs > come into play. And maybe you're using hostnames rather than IP > addresses directly, so then name resolution comes into play (DNS or > /etc/hosts). And each switch hop you add requires that not only your > server network config needs to be correct, but also your switch config > needs to be correct as well. And if you're going over the Internet, > well... I'd probably try really hard to not use NFS in that case! :) > > Do you know if your NFS mount is using TCP or UDP? On the client you > can do something like this: > > grep nfs /proc/mounts | less -S > > And then look at what the "proto=XXX" says. I expect it will be > either "tcp" or "udp". If it's UDP, modify your /etc/fstab so that > the options for that mountpoint include "proto=tcp". I *think* the > default is now TCP, so this may be a non-starter. But the point is, > based purely on the conjecture that you might have an unreliable > network, TCP would be a better fit.I assume TCP, but I will check tomorrow when I am on site.> I hate to simply say "RTFM", but NFS is complex, and I still go back > and re-read the NFS man page ("man nfs"). This document is long and > very dense, but it's worth at least being familiar with its content.Yes, I agree. I skimmed it last week, but I will look at it in detail tomorrow.>> The first client message is always at the same time as the hanging starts. >> The second client message comes 20 minutes later. >> The server message comes 4 minutes after that. >> Then 3 minutes later the script un-hangs (if it's going to). > > In my experience, delays that happen on consistent time intervals that > are on the order of minutes tend to smell of some kind of timeout > scenario. So the question is, what triggers the timeout state? > >> Can anyone shed any light on to what could be happening here and/or what I >> could do to alleviate these issues and stop the script from hanging? >> Perhaps some NFS config settings? We do not have any, so we are using the >> defaults. > > My general rule of thumb is "defaults are generally good enough; make > changes only if you understand their implications and you know you > need them (or temporarily as a diagnostic tool)".I would like to try increasing the timeout.> But anyway, my hunch is that there might be a network issue. So I'd > actually start with basic network troubleshooting. Do an "ifconfig" > on both machines: do you see any drops or interface errors? Do > "ethtool <interface>" on both machines to make sure both are linked up > at the correct speed and duplex. Use a tool like netperf to check > bandwidth between both hosts. Look at the actual detailed stats, do > you see huge outliers or timeouts? Do the test with both TCP and UDP, > performance should be similar with a (typically) slight gain with UDP. > Do you see drops with UDP? > > What's the state of the hardware? Are they ancient machines cobbled > together from spare parts, or reasonable decent machines? Do they > have adequate cooling and power? Is there any chance they are > overheating (even briefly) or possibly being fed unclean power (e.g. > small voltage aberrations)?The hardware is new, and is in a rack in a server room with adequate and monitored cooling and power. But I just found out from someone on site that there is a disk failure, which happened back on Sept 3. The system uses RAID, but I don't know what level. I was told it can tolerate 3 disk failures and still keep working, but personally, I think all bets are off until the disk has been replaced. That should happen in the next day or 2, so we shall see.> Oh, also, look at the load on the two machines... are these > purpose-built servers, or are they used for other numerous tasks? > Perhaps one or both is overloaded. top is the tool we use > instinctively, but also take a look at vmstat and iostat. Oh, also > check "free", make sure neither machine is swapping (thrashing).I've been watching and monitoring the machines for 2 days and neither one has had a large CPU load, not has been using much memory.> If > you're not already doing this, I would recommend setting up "sar" > (from the package "sysstat") and setting up more granular logging than > the default. sar is kind of like a continuous > iostat+free+top+vmstat+other system load tools rolled into one that > continually writes this information to a database. So for example, > next time this thing happens, you can look at the sar logs to see if > any particular metric went significantly out-of-whack.That is a good idea, I will check those logs, and set up better logging.> > That's all I can think of for now. Best of luck. You have my > sympathy... I've been administering Linux both as a hobby and > professionally for longer than I care to admit, and NFS still scares > me. Just be thankful you're not using Kerberized NFS. ;)Thanks! Larry
On Sun, Oct 23, 2016 at 9:02 AM, Larry Martell <larry.martell at gmail.com> wrote:> Hi Matt- > > Thank you for this very detailed and thoughtful reply. > > On Fri, Oct 21, 2016 at 4:43 PM, Matt Garman <matthew.garman at gmail.com> wrote: >> On Fri, Oct 21, 2016 at 4:14 AM, Larry Martell <larry.martell at gmail.com> wrote: >>> We have 1 system ruining Centos7 that is the NFS server. There are 50 >>> external machines that FTP files to this server fairly continuously. >>> >>> We have another system running Centos6 that mounts the partition the files >>> are FTP-ed to using NFS. >>> >>> There is a python script running on the NFS client machine that is reading >>> these files and moving them to a new dir on the same file system (a mv not >>> a cp). >> >> To be clear: the python script is moving files on the same NFS file >> system? E.g., something like >> >> mv /mnt/nfs-server/dir1/file /mnt/nfs-server/dir2/file >> >> where /mnt/nfs-server is the mount point of the NFS server on the >> client machine? > > Correct. > >> Or are you moving files from the CentOS 7 NFS server to the CentOS 6 NFS client? > > No the files are FTP-ed to the CentOS 7 NFS server and then processed > and moved on the CentOS 6 NFS client. > >> If the former, i.e., you are moving files to and from the same system, >> is it possible to completely eliminate the C6 client system, and just >> set up a local script on the C7 server that does the file moves? That >> would cut out a lot of complexity, and also improve performance >> dramatically. > > The problem doing that is the files are processed and loaded to MySQL > and then moved by a script that uses the Django ORM, and neither > django, nor any of the other python packages needed are installed on > the server. And since the server does not have an external internet > connection (as I mentioned in my reply to Mark) getting it set up > would require a large amount of effort. > > Also, we have this exact same setup on over 10 other systems, and it > is only this one that is having a problem. The one difference with > this one is that the sever is CentOS7 - on all the other systems both > the NFS server and client are CentOS6. > >> Also, what is the size range of these files? Are they fairly small >> (e.g. 10s of MB or less), medium-ish (100s of MB) or large (>1GB)? > > Small - They range in size from about 100K to 6M. > >>> Almost daily this script hangs while reading a file - sometimes it never >>> comes back and cannot be killed, even with -9. Other times it hangs for 1/2 >>> hour then proceeds on. >> >> Timeouts relating to NFS are the worst. >> >> >>> Coinciding with the hanging I see this message on the NFS server host: >>> >>> nfsd: peername failed (error 107) >>> >>> And on the NFS client host I see this: >>> >>> nfs: V4 server returned a bad sequence-id >>> nfs state manager - check lease failed on NFSv4 server with error 5 >> >> I've been wrangling with NFS for years, but unfortunately those >> particular messages don't ring a bell. >> >> The first thing that came to my mind is: how does the Python script >> running on the C6 client know that the FTP upload to the C7 server is >> complete? In other words, if someone is uploading "fileA", and the >> Python script starts to move "fileA" before the upload is complete, >> then at best you're setting yourself up for all kinds of confusion, >> and at worst file truncation and/or corruption. > > The python script checks the modification time of the file, and only > if it has not been modified in more then 2 minutes does it process it. > Otherwise it skips it and waits for the next run to potentially > process it. Also, the script can tell if the file is incomplete in a > few different ways. So if it has not been modified in more then 2 > minutes, the script starts to process it, but if it finds that it's > incomplete it aborts the processing and leaves it for next time. > >> Making a pure guess about those particular errors: is there any chance >> there is a network issue between the C7 server and the C6 client? >> What is the connection between those two servers? Are they physically >> adjacent to each other and on the same subnet? Or are they on >> opposite ends of the globe connected through the Internet? > > Actually both the client and server are virtual machines running on > one physical machine. The physical machine is running CentOS7. There > is nothing else running on the physical machine other then the 2 VMs.I misspoke here - the CentOS7 NFS server is running on the physical hardware, it's not a VM. The CentOS6 client is a VM.> >> Clearly two machines on the same subnet, separated only by one switch >> is the simplest case (i.e. the kind of simple LAN one might have in >> his home). But once you start crossing subnets, then routing configs >> come into play. And maybe you're using hostnames rather than IP >> addresses directly, so then name resolution comes into play (DNS or >> /etc/hosts). And each switch hop you add requires that not only your >> server network config needs to be correct, but also your switch config >> needs to be correct as well. And if you're going over the Internet, >> well... I'd probably try really hard to not use NFS in that case! :) >> >> Do you know if your NFS mount is using TCP or UDP? On the client you >> can do something like this: >> >> grep nfs /proc/mounts | less -S >> >> And then look at what the "proto=XXX" says. I expect it will be >> either "tcp" or "udp". If it's UDP, modify your /etc/fstab so that >> the options for that mountpoint include "proto=tcp". I *think* the >> default is now TCP, so this may be a non-starter. But the point is, >> based purely on the conjecture that you might have an unreliable >> network, TCP would be a better fit. > > I assume TCP, but I will check tomorrow when I am on site. > >> I hate to simply say "RTFM", but NFS is complex, and I still go back >> and re-read the NFS man page ("man nfs"). This document is long and >> very dense, but it's worth at least being familiar with its content. > > Yes, I agree. I skimmed it last week, but I will look at it in detail tomorrow. > >>> The first client message is always at the same time as the hanging starts. >>> The second client message comes 20 minutes later. >>> The server message comes 4 minutes after that. >>> Then 3 minutes later the script un-hangs (if it's going to). >> >> In my experience, delays that happen on consistent time intervals that >> are on the order of minutes tend to smell of some kind of timeout >> scenario. So the question is, what triggers the timeout state? >> >>> Can anyone shed any light on to what could be happening here and/or what I >>> could do to alleviate these issues and stop the script from hanging? >>> Perhaps some NFS config settings? We do not have any, so we are using the >>> defaults. >> >> My general rule of thumb is "defaults are generally good enough; make >> changes only if you understand their implications and you know you >> need them (or temporarily as a diagnostic tool)". > > I would like to try increasing the timeout. > >> But anyway, my hunch is that there might be a network issue. So I'd >> actually start with basic network troubleshooting. Do an "ifconfig" >> on both machines: do you see any drops or interface errors? Do >> "ethtool <interface>" on both machines to make sure both are linked up >> at the correct speed and duplex. Use a tool like netperf to check >> bandwidth between both hosts. Look at the actual detailed stats, do >> you see huge outliers or timeouts? Do the test with both TCP and UDP, >> performance should be similar with a (typically) slight gain with UDP. >> Do you see drops with UDP? >> >> What's the state of the hardware? Are they ancient machines cobbled >> together from spare parts, or reasonable decent machines? Do they >> have adequate cooling and power? Is there any chance they are >> overheating (even briefly) or possibly being fed unclean power (e.g. >> small voltage aberrations)? > > The hardware is new, and is in a rack in a server room with adequate > and monitored cooling and power. But I just found out from someone on > site that there is a disk failure, which happened back on Sept 3. The > system uses RAID, but I don't know what level. I was told it can > tolerate 3 disk failures and still keep working, but personally, I > think all bets are off until the disk has been replaced. That should > happen in the next day or 2, so we shall see. > >> Oh, also, look at the load on the two machines... are these >> purpose-built servers, or are they used for other numerous tasks? >> Perhaps one or both is overloaded. top is the tool we use >> instinctively, but also take a look at vmstat and iostat. Oh, also >> check "free", make sure neither machine is swapping (thrashing). > > I've been watching and monitoring the machines for 2 days and neither > one has had a large CPU load, not has been using much memory. > >> If >> you're not already doing this, I would recommend setting up "sar" >> (from the package "sysstat") and setting up more granular logging than >> the default. sar is kind of like a continuous >> iostat+free+top+vmstat+other system load tools rolled into one that >> continually writes this information to a database. So for example, >> next time this thing happens, you can look at the sar logs to see if >> any particular metric went significantly out-of-whack. > > That is a good idea, I will check those logs, and set up better logging. > >> >> That's all I can think of for now. Best of luck. You have my >> sympathy... I've been administering Linux both as a hobby and >> professionally for longer than I care to admit, and NFS still scares >> me. Just be thankful you're not using Kerberized NFS. ;) > > Thanks! > Larry
On Sun, Oct 23, 2016 at 9:02 AM, Larry Martell <larry.martell at gmail.com> wrote:> Hi Matt- > > Thank you for this very detailed and thoughtful reply. > > On Fri, Oct 21, 2016 at 4:43 PM, Matt Garman <matthew.garman at gmail.com> wrote: >> On Fri, Oct 21, 2016 at 4:14 AM, Larry Martell <larry.martell at gmail.com> wrote: >>> We have 1 system ruining Centos7 that is the NFS server. There are 50 >>> external machines that FTP files to this server fairly continuously. >>> >>> We have another system running Centos6 that mounts the partition the files >>> are FTP-ed to using NFS. >>> >>> There is a python script running on the NFS client machine that is reading >>> these files and moving them to a new dir on the same file system (a mv not >>> a cp). >> >> To be clear: the python script is moving files on the same NFS file >> system? E.g., something like >> >> mv /mnt/nfs-server/dir1/file /mnt/nfs-server/dir2/file >> >> where /mnt/nfs-server is the mount point of the NFS server on the >> client machine? > > Correct. > >> Or are you moving files from the CentOS 7 NFS server to the CentOS 6 NFS client? > > No the files are FTP-ed to the CentOS 7 NFS server and then processed > and moved on the CentOS 6 NFS client. > >> If the former, i.e., you are moving files to and from the same system, >> is it possible to completely eliminate the C6 client system, and just >> set up a local script on the C7 server that does the file moves? That >> would cut out a lot of complexity, and also improve performance >> dramatically. > > The problem doing that is the files are processed and loaded to MySQL > and then moved by a script that uses the Django ORM, and neither > django, nor any of the other python packages needed are installed on > the server. And since the server does not have an external internet > connection (as I mentioned in my reply to Mark) getting it set up > would require a large amount of effort. > > Also, we have this exact same setup on over 10 other systems, and it > is only this one that is having a problem. The one difference with > this one is that the sever is CentOS7 - on all the other systems both > the NFS server and client are CentOS6. > >> Also, what is the size range of these files? Are they fairly small >> (e.g. 10s of MB or less), medium-ish (100s of MB) or large (>1GB)? > > Small - They range in size from about 100K to 6M. > >>> Almost daily this script hangs while reading a file - sometimes it never >>> comes back and cannot be killed, even with -9. Other times it hangs for 1/2 >>> hour then proceeds on. >> >> Timeouts relating to NFS are the worst. >> >> >>> Coinciding with the hanging I see this message on the NFS server host: >>> >>> nfsd: peername failed (error 107) >>> >>> And on the NFS client host I see this: >>> >>> nfs: V4 server returned a bad sequence-id >>> nfs state manager - check lease failed on NFSv4 server with error 5 >> >> I've been wrangling with NFS for years, but unfortunately those >> particular messages don't ring a bell. >> >> The first thing that came to my mind is: how does the Python script >> running on the C6 client know that the FTP upload to the C7 server is >> complete? In other words, if someone is uploading "fileA", and the >> Python script starts to move "fileA" before the upload is complete, >> then at best you're setting yourself up for all kinds of confusion, >> and at worst file truncation and/or corruption. > > The python script checks the modification time of the file, and only > if it has not been modified in more then 2 minutes does it process it. > Otherwise it skips it and waits for the next run to potentially > process it. Also, the script can tell if the file is incomplete in a > few different ways. So if it has not been modified in more then 2 > minutes, the script starts to process it, but if it finds that it's > incomplete it aborts the processing and leaves it for next time. > >> Making a pure guess about those particular errors: is there any chance >> there is a network issue between the C7 server and the C6 client? >> What is the connection between those two servers? Are they physically >> adjacent to each other and on the same subnet? Or are they on >> opposite ends of the globe connected through the Internet? > > Actually both the client and server are virtual machines running on > one physical machine. The physical machine is running CentOS7. There > is nothing else running on the physical machine other then the 2 VMs. > >> Clearly two machines on the same subnet, separated only by one switch >> is the simplest case (i.e. the kind of simple LAN one might have in >> his home). But once you start crossing subnets, then routing configs >> come into play. And maybe you're using hostnames rather than IP >> addresses directly, so then name resolution comes into play (DNS or >> /etc/hosts). And each switch hop you add requires that not only your >> server network config needs to be correct, but also your switch config >> needs to be correct as well. And if you're going over the Internet, >> well... I'd probably try really hard to not use NFS in that case! :) >> >> Do you know if your NFS mount is using TCP or UDP? On the client you >> can do something like this: >> >> grep nfs /proc/mounts | less -S >> >> And then look at what the "proto=XXX" says. I expect it will be >> either "tcp" or "udp". If it's UDP, modify your /etc/fstab so that >> the options for that mountpoint include "proto=tcp". I *think* the >> default is now TCP, so this may be a non-starter. But the point is, >> based purely on the conjecture that you might have an unreliable >> network, TCP would be a better fit.On the client the settings are: nfs4 rw.realtime,vers=4,rsize=1048576,wsize=104576,namelen=255,hard,proto=trcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=192.168.2.200,minorversion=0,locallock=none,addr=192.168.2.102 0 0 On the server the settings are: xfs rw,realtme,inode64,noquota 0 0 Do these look OK?> I assume TCP, but I will check tomorrow when I am on site. > >> I hate to simply say "RTFM", but NFS is complex, and I still go back >> and re-read the NFS man page ("man nfs"). This document is long and >> very dense, but it's worth at least being familiar with its content. > > Yes, I agree. I skimmed it last week, but I will look at it in detail tomorrow. > >>> The first client message is always at the same time as the hanging starts. >>> The second client message comes 20 minutes later. >>> The server message comes 4 minutes after that. >>> Then 3 minutes later the script un-hangs (if it's going to). >> >> In my experience, delays that happen on consistent time intervals that >> are on the order of minutes tend to smell of some kind of timeout >> scenario. So the question is, what triggers the timeout state? >> >>> Can anyone shed any light on to what could be happening here and/or what I >>> could do to alleviate these issues and stop the script from hanging? >>> Perhaps some NFS config settings? We do not have any, so we are using the >>> defaults. >> >> My general rule of thumb is "defaults are generally good enough; make >> changes only if you understand their implications and you know you >> need them (or temporarily as a diagnostic tool)". > > I would like to try increasing the timeout. > >> But anyway, my hunch is that there might be a network issue. So I'd >> actually start with basic network troubleshooting. Do an "ifconfig" >> on both machines: do you see any drops or interface errors?None on the client. On the server it has 1 dropped Rx packet.> Do >> "ethtool <interface>" on both machines to make sure both are linked up >> at the correct speed and duplex.That reports only "Link detected: yes" for both client and server.> Use a tool like netperf to check >> bandwidth between both hosts. Look at the actual detailed stats, do >> you see huge outliers or timeouts? Do the test with both TCP and UDP, >> performance should be similar with a (typically) slight gain with UDP. >> Do you see drops with UDP? >> >> What's the state of the hardware? Are they ancient machines cobbled >> together from spare parts, or reasonable decent machines? Do they >> have adequate cooling and power? Is there any chance they are >> overheating (even briefly) or possibly being fed unclean power (e.g. >> small voltage aberrations)? > > The hardware is new, and is in a rack in a server room with adequate > and monitored cooling and power. But I just found out from someone on > site that there is a disk failure, which happened back on Sept 3. The > system uses RAID, but I don't know what level. I was told it can > tolerate 3 disk failures and still keep working, but personally, I > think all bets are off until the disk has been replaced. That should > happen in the next day or 2, so we shall see. > >> Oh, also, look at the load on the two machines... are these >> purpose-built servers, or are they used for other numerous tasks? >> Perhaps one or both is overloaded. top is the tool we use >> instinctively, but also take a look at vmstat and iostat. Oh, also >> check "free", make sure neither machine is swapping (thrashing). > > I've been watching and monitoring the machines for 2 days and neither > one has had a large CPU load, not has been using much memory. > >> If >> you're not already doing this, I would recommend setting up "sar" >> (from the package "sysstat") and setting up more granular logging than >> the default. sar is kind of like a continuous >> iostat+free+top+vmstat+other system load tools rolled into one that >> continually writes this information to a database. So for example, >> next time this thing happens, you can look at the sar logs to see if >> any particular metric went significantly out-of-whack. > > That is a good idea, I will check those logs, and set up better logging.sar seems to be running, but I can only get it to report on the current day. The man page shows start and end time options, but is there a way to specify the stand and end date?>> That's all I can think of for now. Best of luck. You have my >> sympathy... I've been administering Linux both as a hobby and >> professionally for longer than I care to admit, and NFS still scares >> me. Just be thankful you're not using Kerberized NFS. ;) > > Thanks! > Larry
On Sun, Oct 23, 2016 at 8:02 AM, Larry Martell <larry.martell at gmail.com> wrote:>> To be clear: the python script is moving files on the same NFS file >> system? E.g., something like >> >> mv /mnt/nfs-server/dir1/file /mnt/nfs-server/dir2/file >> >> where /mnt/nfs-server is the mount point of the NFS server on the >> client machine? > > Correct. > >> Or are you moving files from the CentOS 7 NFS server to the CentOS 6 NFS client? > > No the files are FTP-ed to the CentOS 7 NFS server and then processed > and moved on the CentOS 6 NFS client.I apologize if I'm being dense here, but I'm more confused on this data flow now. Your use of "correct" and "no" seems to be inconsistent with your explanation. Sorry! At any rate, what I was looking at was seeing if there was any way to simplify this process, and cut NFS out of the picture. If you need only to push these files around, what about rsync?> The problem doing that is the files are processed and loaded to MySQL > and then moved by a script that uses the Django ORM, and neither > django, nor any of the other python packages needed are installed on > the server. And since the server does not have an external internet > connection (as I mentioned in my reply to Mark) getting it set up > would require a large amount of effort....right, but I'm pretty sure rsync should be installed on the server; I believe it's default in all except the "minimal" setup profiles. Either way, it's trivial to install, as I don't think it has any dependencies. You can download the rsync rpm from mirror.centos.org, then scp it to the server, then install via yum. And Python is definitely installed (requirement for yum) and Perl is probably installed as well, so with rsync plus some basic Perl/Python scripting you can create your own mover script. Actually, rsync may not even be necessary, scp may be sufficient for your purposes. And scp should definitely be installed.> Also, we have this exact same setup on over 10 other systems, and it > is only this one that is having a problem. The one difference with > this one is that the sever is CentOS7 - on all the other systems both > the NFS server and client are CentOS6.>From what you've described so far, with what appears to be arelatively simple config, C6 or C7 "shouldn't" matter. However, under the hood, C6 and C7 are quite different.> The python script checks the modification time of the file, and only > if it has not been modified in more then 2 minutes does it process it. > Otherwise it skips it and waits for the next run to potentially > process it. Also, the script can tell if the file is incomplete in a > few different ways. So if it has not been modified in more then 2 > minutes, the script starts to process it, but if it finds that it's > incomplete it aborts the processing and leaves it for next time.This script runs on C7 or C6?> The hardware is new, and is in a rack in a server room with adequate > and monitored cooling and power. But I just found out from someone on > site that there is a disk failure, which happened back on Sept 3. The > system uses RAID, but I don't know what level. I was told it can > tolerate 3 disk failures and still keep working, but personally, I > think all bets are off until the disk has been replaced. That should > happen in the next day or 2, so we shall see.OK, depending on the RAID scheme and how it's implemented, there could be disk timeouts causing things to hang.> I've been watching and monitoring the machines for 2 days and neither > one has had a large CPU load, not has been using much memory.How about iostat? Also, good old "dmesg" can suggest if the system with the failed drive is causing timeouts to occur.> None on the client. On the server it has 1 dropped Rx packet. > >> Do >>> "ethtool <interface>" on both machines to make sure both are linked up >>> at the correct speed and duplex. > > That reports only "Link detected: yes" for both client and server.OK, but ethtool should also say something like: ... Speed: 1000Mb/s Duplex: Full ... For a 1gbps network. If Duplex is reported as "half", then that is definitely a problem. Using netperf is further confirmation of whether or not your network is functioning as expected.> sar seems to be running, but I can only get it to report on the > current day. The man page shows start and end time options, but is > there a way to specify the stand and end date?If you want to report on a day in the past, you have to pass the file argument, something like this: sar -A -f /var/log/sa/sa23 -s 07:00:00 -e 08:00:00 That would show you yesterday's data between 7am and 8am. The files in /var/log/sa/saXX are the files that correspond to the day. By default, XX will be the day of the month.
On Mon, Oct 24, 2016 at 1:32 PM, Matt Garman <matthew.garman at gmail.com> wrote:> On Sun, Oct 23, 2016 at 8:02 AM, Larry Martell <larry.martell at gmail.com> wrote: >>> To be clear: the python script is moving files on the same NFS file >>> system? E.g., something like >>> >>> mv /mnt/nfs-server/dir1/file /mnt/nfs-server/dir2/file >>> >>> where /mnt/nfs-server is the mount point of the NFS server on the >>> client machine? >> >> Correct. >> >>> Or are you moving files from the CentOS 7 NFS server to the CentOS 6 NFS client? >> >> No the files are FTP-ed to the CentOS 7 NFS server and then processed >> and moved on the CentOS 6 NFS client. > > > I apologize if I'm being dense here, but I'm more confused on this > data flow now. Your use of "correct" and "no" seems to be > inconsistent with your explanation. Sorry!I though you were asking "Are you doing: A: moving files on the same NFS filesystem , or B: moving them across filesystems? And I replied, "Correct I am doing A, no I am not doing B." The script moves the files from /mnt/nfs-server/dir1/file to /mnt/nfs-server/dir2/file.> At any rate, what I was looking at was seeing if there was any way to > simplify this process, and cut NFS out of the picture. If you need > only to push these files around, what about rsync?It's not just moving files around. The files are read, and their contents are loaded into a MySQL database.>> The problem doing that is the files are processed and loaded to MySQL >> and then moved by a script that uses the Django ORM, and neither >> django, nor any of the other python packages needed are installed on >> the server. And since the server does not have an external internet >> connection (as I mentioned in my reply to Mark) getting it set up >> would require a large amount of effort. > > ...right, but I'm pretty sure rsync should be installed on the server; > I believe it's default in all except the "minimal" setup profiles. > Either way, it's trivial to install, as I don't think it has any > dependencies. You can download the rsync rpm from mirror.centos.org, > then scp it to the server, then install via yum. And Python is > definitely installed (requirement for yum) and Perl is probably > installed as well, so with rsync plus some basic Perl/Python scripting > you can create your own mover script. > > Actually, rsync may not even be necessary, scp may be sufficient for > your purposes. And scp should definitely be installed.This site is not in any way connected to the internet, and you cannot bring in any computers, phones, or media of any kind. There is a process to get machines or files in, but it is onerous and time consuming. This system was set up and configured off site and then brought on site. To run the script on the C7 NFS server instead of the C6 NFS client many python libs will have to installed. I do have someone off site working on setting up a local yum repo with what I need, and then we are going to see if we can zip and email the repo and get it on site. But none of us are sys admins and we don't really know what we're doing so we may not succeed and it may take longer then I will be here in Japan (I am scheduled to leave Saturday).>> Also, we have this exact same setup on over 10 other systems, and it >> is only this one that is having a problem. The one difference with >> this one is that the sever is CentOS7 - on all the other systems both >> the NFS server and client are CentOS6. > > From what you've described so far, with what appears to be a > relatively simple config, C6 or C7 "shouldn't" matter. However, under > the hood, C6 and C7 are quite different. > >> The python script checks the modification time of the file, and only >> if it has not been modified in more then 2 minutes does it process it. >> Otherwise it skips it and waits for the next run to potentially >> process it. Also, the script can tell if the file is incomplete in a >> few different ways. So if it has not been modified in more then 2 >> minutes, the script starts to process it, but if it finds that it's >> incomplete it aborts the processing and leaves it for next time. > > This script runs on C7 or C6?C6> >> The hardware is new, and is in a rack in a server room with adequate >> and monitored cooling and power. But I just found out from someone on >> site that there is a disk failure, which happened back on Sept 3. The >> system uses RAID, but I don't know what level. I was told it can >> tolerate 3 disk failures and still keep working, but personally, I >> think all bets are off until the disk has been replaced. That should >> happen in the next day or 2, so we shall see. > > OK, depending on the RAID scheme and how it's implemented, there could > be disk timeouts causing things to hang.Yes, that's why when I found about the disk failure I wanted to hold off doing anything until the disk gets replaced. But as that is not happening until Wenesday afternoon I think I want to try Mark's nobarrier conifg option today.>> I've been watching and monitoring the machines for 2 days and neither >> one has had a large CPU load, not has been using much memory. > > How about iostat? Also, good old "dmesg" can suggest if the system > with the failed drive is causing timeouts to occur.Nothing in dmesg or /var/log/messages about the failed disk at all. I only saw that when I got on the Integrated Management Module console. But the logs only go back to Sep 21 and the disk failed on Sep 3. The logs only have the NFS errors, no other errors.> > >> None on the client. On the server it has 1 dropped Rx packet. >> >>> Do >>>> "ethtool <interface>" on both machines to make sure both are linked up >>>> at the correct speed and duplex. >> >> That reports only "Link detected: yes" for both client and server. > > OK, but ethtool should also say something like: > > ... > Speed: 1000Mb/s > Duplex: Full > ...No it outputs just the one line: Link detected: yes> For a 1gbps network. If Duplex is reported as "half", then that is > definitely a problem. Using netperf is further confirmation of > whether or not your network is functioning as expected. > > >> sar seems to be running, but I can only get it to report on the >> current day. The man page shows start and end time options, but is >> there a way to specify the stand and end date? > > If you want to report on a day in the past, you have to pass the file > argument, something like this: > > sar -A -f /var/log/sa/sa23 -s 07:00:00 -e 08:00:00 > > That would show you yesterday's data between 7am and 8am. The files > in /var/log/sa/saXX are the files that correspond to the day. By > default, XX will be the day of the month.OK, Thanks.