Hi All, I have an older CentOS 7.4 system that is used for computationally heavy work. It has a 32G root filesystem, of which 33% is consumed. Lately, one particular set of jobs (run through the SGE batch scheduler) seems to cause a peculiar condition to occur in which the root filesystem space is exhausted but I can't find any files that I can use to identify which process is causing the problem. I'm guessing that the problem is being triggered by a certain set of jobs since it only occurs when those jobs are running. I would like to identify what the process is that is causing this condition to occur and my standard approach is to identify which files are using the otherwise empty space and then identify which process is using the files. It's not working, I haven't been able to identify the files. They aren't there. All attempts to measure disk usage of / by files shows that the disk usage is only a percentage of available space and that there should be space available. I know that space can be consumed by deleted files for which file handles are still open, not visible by ls but detectable but by using techniques like: lsof | grep deleted and for sizing find /proc/*/fd -ls 2>/dev/null | grep '(deleted)' | \ sed 's!.*\(/proc[^ ]*\).*!\1!' | xargs wc -c | sort -nr Applying this technique during one of these episodes shows nothing of interest. I'm definitely missing something. Are there any other techniques for identifying other "invisible" files that I can look for? Thanks in advance for any advice.
centos2 at foxengines.net wrote:> It's not working, I haven't been able to identify the files. Theyaren't> there. All attempts to measure disk usage of / by files shows that the > disk usage is only a percentage of available space and that there > should be space available.Sparse files? How are you determining how much free space you have? -- Yves Bellefeuille <yan at storm.ca>
On Thu, Oct 08, 2020 at 11:12:54AM -0400, Yves Bellefeuille wrote:> centos2 at foxengines.net wrote: > > > It's not working, I haven't been able to identify the files. They > aren't > > there. All attempts to measure disk usage of / by files shows that the > > disk usage is only a percentage of available space and that there > > should be space available. > > Sparse files? How are you determining how much free space you have?Thanks for your response. I didn't attempt to find sparse files specifically but there were no files (or dot-files) at the top level of / that contained any significant data. There sum of the sizes of all of the directories at the top level of / reported by du did not match the amount of disk space used at the time of the problem. I don't have a transcript of that session but I was using commands like: find / -maxdepth 1 -xdev -type d | while read; do du -shx $d; done I poked around in /var and /tmp a lot but didn't find anything that would contradict the output of the previous command. At this point I started searching for deleted files for which the space had not been reclaimed. Finding nothing I though there was something I hadn't run into before and didn't know what to look for. I'm not confident I understand your meaning in the second sentence. I didn't try to determine how much free space I had because there wasn't any. The root filesystem was at 100% capacity and services were failing. I was just trying to find out what had taken it all since normal usage is around 33% or so, according to df. Rebooting the computer eliminates the problem. When it comes back up, the disk usage is again at 33%. Whatever it is, vanishes during a reboot.
Does the filesystem have a fixed number of inodes? Perhaps the problem is the number of files, not their sizes. Does the filesystem have an explicit free list? If so, I'd expect there to be tools that could tell you how much was on it. -- Michael hennebry at web.cs.ndsu.NoDak.edu "Sorry but your password must contain an uppercase letter, a number, a haiku, a gang sign, a heiroglyph, and the blood of a virgin." -- someeecards