On Wed, May 22, 2019 at 10:02 AM mark <m.roth at 5-cent.us> wrote:
> That seems unlikely. Foe one, I've seen that... but I *always* see
entries
> in the log about the oom-killer being invoked. For another, this isn't
a
> compute node, it's *only* a fileserver, serving projects, home
> directories, and backups (home-grown b/u, uses rsync), and backups
don't
> start until well after midnight, and as we're business-hours only,
there
> was less usage, and it does have 256G RAM....
>
I have two servers that would lock up like this occasionally, and if I let
them sit at the console long enough sometimes they would give a login
prompt. It took a lot of time and frustration (these are prod servers) but
I tracked it down to a problem in the XFS driver, as it never occurred on
the systems with EXT4 filesystems. The XFS driver would hang, preventing
writes to the filesystem. I could identify exactly when that happened as
all system logging would suddenly stop at the same second. Then OOMKiller
would come in and start killing off processes but that wouldn't be in the
logs on disk because the file system couldn't write. I rolled the servers
back to a 5xx series kernel and the issue didn't resurface. I recently let
them boot the newer 9xx series kernels and I'm hoping the XFS issue is
fixed.