thr3ads.net - CentOS - [CentOS] kerberized-nfs - any experts out there? [Mar 2017]

If this information is useful, please help other people find it:
Share via:

Matt Garman

2017-Mar-22 19:26 UTC

[CentOS] kerberized-nfs - any experts out there?

Is anyone on the list using kerberized-nfs on any kind of scale?

I've been fighting with this for years.  In general, when we have
issues with this system, they are random and/or not repeatable.  I've
had very little luck with community support.  I hope I don't offend by
saying that!  Rather, my belief is that these problems are very
niche/esoteric, and so beyond the scope of typical community support.
But I'd be delighted to be proven wrong!

So this is more of a "meta" question: anyone out there have any
general recommendations for how to get support on what I presume are
niche problems specific to our environment?  How is paid upstream
support?

Just to give a little insight into our issues: we have an
in-house-developed compute job dispatching system.  Say a user has
100s of analysis jobs he wants to run, he submits them to a central
master process, which in turn dispatches them to a "farm" of >100
compute nodes.  All these nodes have two different krb5p NFS mounts,
to which the jobs will read and write.  So while the users can
technically log in directly to the compute nodes, in practice they
never do.  The logins are only "implicit" when the job dispatching
system does a behind-the-scenes ssh to kick off these processes.

Just to give some "flavor" to the kinds of issues we're facing,
what
tends to crop up are one of three things:

    (1) Random crashes.  These are full-on kernel trace dumps followed
by an automatic reboot.  This was really bad under CentOS 5.  A random
kernel upgrade magically fixed it.  It happens almost never under
CentOS 6.  But happens fairly frequently under CentOS 7.  (We're
completely off CentOS 5 now, BTW.)

    (2) Permission denied issues.  I have user Kerberos tickets
configured for 70 days.  But there is clearly some kind of
undocumented kernel caching going on.  Looking at the Kerberos server
logs, it looks like it "could" be a performance issue, as I see 100s
of ticket requests within the same second when someone tries to launch
a lot of jobs.  Many of these will fail with "permission denied" but
if they immediately re-try, it works.  Related to this, I have been
unable to figure out what creates and deletes the
/tmp/krb5cc_uid_random files.

    (3) Kerberized NFS shares getting "stuck" for one or more users.
We have another monitoring app (in-house developed) that, among other
things, makes periodic checks of these NFS mounts.  It does so by
forking and doing a simple "ls" command.  This is to ensure that these
mounts are alive and well.  Sometimes, the "ls" command gets stuck to
the point where it can't even be killed via "kill -9".  Only a
reboot
fixes it.  But the mount is only stuck for the user running the
monitoring app.  Or sometimes the monitoring app is fine, but an
actual user's processes will get stuck in "D" state (in top, means
waiting on IO), but everyone else's jobs (and access to the kerberizes
nfs shares) are OK.

This is actually blocking us from upgrading to CentOS 7.  But my
colleagues and I are at a loss how to solve this.  So this post is
really more of a semi-desperate plea for any kind of advice.  What
other resources might we consider?  Paid support is not out of the
question (within reason).  Are there any "super specialist"
consultants out there who deal in Kerberized NFS?

Thanks!
Matt

m.roth at 5-cent.us

2017-Mar-22 20:19 UTC

head link

[CentOS] kerberized-nfs - any experts out there?

Matt Garman wrote:> Is anyone on the list using kerberized-nfs on any kind of scale?
>We use it here. I don't think I'm an expert - my manager is - but let me
think about your issues.
<snip>> Just to give a little insight into our issues: we have an
> in-house-developed compute job dispatching system.  Say a user has
> 100s of analysis jobs he wants to run, he submits them to a central
> master process, which in turn dispatches them to a "farm" of
>100
> compute nodes.  All these nodes have two different krb5p NFS mounts,
> to which the jobs will read and write.  So while the users can
> technically log in directly to the compute nodes, in practice they
> never do.  The logins are only "implicit" when the job
dispatching
> system does a behind-the-scenes ssh to kick off these processes.
I would strongly recommend that you look into slurm. It's being used here
in both large and small scale, and is explicitly for that
purpose.>
> Just to give some "flavor" to the kinds of issues we're
facing, what
> tends to crop up are one of three things:
>
>     (1) Random crashes.  These are full-on kernel trace dumps followed
> by an automatic reboot.  This was really bad under CentOS 5.  A random
> kernel upgrade magically fixed it.  It happens almost never under
> CentOS 6.  But happens fairly frequently under CentOS 7.  (We're
> completely off CentOS 5 now, BTW.)
This may possibly be another issue.>
>     (2) Permission denied issues.  I have user Kerberos tickets
> configured for 70 days.  But there is clearly some kind of
> undocumented kernel caching going on.  Looking at the Kerberos server
> logs, it looks like it "could" be a performance issue, as I see
100s
> of ticket requests within the same second when someone tries to launch
> a lot of jobs.  Many of these will fail with "permission denied"
but
> if they immediately re-try, it works.  Related to this, I have been
> unable to figure out what creates and deletes the
> /tmp/krb5cc_uid_random files.
Are they asking for *new* credentials each time? They should only be doing
one kinit.>
>     (3) Kerberized NFS shares getting "stuck" for one or more
users.
> We have another monitoring app (in-house developed) that, among other
> things, makes periodic checks of these NFS mounts.  It does so by
> forking and doing a simple "ls" command.  This is to ensure that
these
> mounts are alive and well.  Sometimes, the "ls" command gets
stuck to
> the point where it can't even be killed via "kill -9".  Only
a reboot
> fixes it.  But the mount is only stuck for the user running the
> monitoring app.  Or sometimes the monitoring app is fine, but an
> actual user's processes will get stuck in "D" state (in top,
means
> waiting on IO), but everyone else's jobs (and access to the kerberizes
> nfs shares) are OK.
And there's nothing in the logs, correct? Have you tried attaching strace
to one of those, and see if you can get a clue as to what's happening?
<snip>

        mark

James A. Peltier

2017-Mar-22 21:25 UTC

head link

[CentOS] kerberized-nfs - any experts out there?

Feel free to contact me offline if you wish.  I'll just go on record as
saying that it's a bear

----- On 22 Mar, 2017, at 12:26, Matt Garman matthew.garman at gmail.com wrote:

| Is anyone on the list using kerberized-nfs on any kind of scale?
| 
| I've been fighting with this for years.  In general, when we have
| issues with this system, they are random and/or not repeatable.  I've
| had very little luck with community support.  I hope I don't offend by
| saying that!  Rather, my belief is that these problems are very
| niche/esoteric, and so beyond the scope of typical community support.
| But I'd be delighted to be proven wrong!
| 
| So this is more of a "meta" question: anyone out there have any
| general recommendations for how to get support on what I presume are
| niche problems specific to our environment?  How is paid upstream
| support?
| 
| Just to give a little insight into our issues: we have an
| in-house-developed compute job dispatching system.  Say a user has
| 100s of analysis jobs he wants to run, he submits them to a central
| master process, which in turn dispatches them to a "farm" of >100
| compute nodes.  All these nodes have two different krb5p NFS mounts,
| to which the jobs will read and write.  So while the users can
| technically log in directly to the compute nodes, in practice they
| never do.  The logins are only "implicit" when the job dispatching
| system does a behind-the-scenes ssh to kick off these processes.
| 
| Just to give some "flavor" to the kinds of issues we're facing,
what
| tends to crop up are one of three things:
| 
|    (1) Random crashes.  These are full-on kernel trace dumps followed
| by an automatic reboot.  This was really bad under CentOS 5.  A random
| kernel upgrade magically fixed it.  It happens almost never under
| CentOS 6.  But happens fairly frequently under CentOS 7.  (We're
| completely off CentOS 5 now, BTW.)
| 
|    (2) Permission denied issues.  I have user Kerberos tickets
| configured for 70 days.  But there is clearly some kind of
| undocumented kernel caching going on.  Looking at the Kerberos server
| logs, it looks like it "could" be a performance issue, as I see 100s
| of ticket requests within the same second when someone tries to launch
| a lot of jobs.  Many of these will fail with "permission denied" but
| if they immediately re-try, it works.  Related to this, I have been
| unable to figure out what creates and deletes the
| /tmp/krb5cc_uid_random files.
| 
|    (3) Kerberized NFS shares getting "stuck" for one or more users.
| We have another monitoring app (in-house developed) that, among other
| things, makes periodic checks of these NFS mounts.  It does so by
| forking and doing a simple "ls" command.  This is to ensure that
these
| mounts are alive and well.  Sometimes, the "ls" command gets stuck
to
| the point where it can't even be killed via "kill -9".  Only a
reboot
| fixes it.  But the mount is only stuck for the user running the
| monitoring app.  Or sometimes the monitoring app is fine, but an
| actual user's processes will get stuck in "D" state (in top,
means
| waiting on IO), but everyone else's jobs (and access to the kerberizes
| nfs shares) are OK.
| 
| This is actually blocking us from upgrading to CentOS 7.  But my
| colleagues and I are at a loss how to solve this.  So this post is
| really more of a semi-desperate plea for any kind of advice.  What
| other resources might we consider?  Paid support is not out of the
| question (within reason).  Are there any "super specialist"
| consultants out there who deal in Kerberized NFS?
| 
| Thanks!
| Matt
| _______________________________________________
| CentOS mailing list
| CentOS at centos.org
| https://lists.centos.org/mailman/listinfo/centos

-- 
James A. Peltier
IT Services - Research Computing Group
Simon Fraser University - Burnaby Campus
Phone   : 604-365-6432
Fax     : 778-782-3045
E-Mail  : jpeltier at sfu.ca
Website : http://www.sfu.ca/itservices
Twitter : @sfu_rcg
Powering Engagement Through Technology

John Jasen

2017-Mar-22 23:11 UTC

head link

[CentOS] kerberized-nfs - any experts out there?

On 03/22/2017 03:26 PM, Matt Garman wrote:> Is anyone on the list using kerberized-nfs on any kind of scale?
Not for a good many years.

Are you using v3 or v4 NFS?

Also, you can probably stuff the rpc.gss* and idmapd services into
verbose mode, which may give you a better ideas as to whats going on.

And yes, the kernel does some kerberos caching. I think 10 to 15 minutes.

Matt Garman

2017-Mar-23 14:50 UTC

head link

[CentOS] kerberized-nfs - any experts out there?

On Wed, Mar 22, 2017 at 3:19 PM,  <m.roth at 5-cent.us>
wrote:> Matt Garman wrote:
>>     (2) Permission denied issues.  I have user Kerberos tickets
>> configured for 70 days.  But there is clearly some kind of
>> undocumented kernel caching going on.  Looking at the Kerberos server
>> logs, it looks like it "could" be a performance issue, as I
see 100s
>> of ticket requests within the same second when someone tries to launch
>> a lot of jobs.  Many of these will fail with "permission
denied" but
>> if they immediately re-try, it works.  Related to this, I have been
>> unable to figure out what creates and deletes the
>> /tmp/krb5cc_uid_random files.
>
> Are they asking for *new* credentials each time? They should only be doing
> one kinit.
Well, that's what I don't understand.  In practice, I don't believe
a
user should ever have to explicitly do kinit, as their
credentials/tickets are implicitly created (and forwarded) via ssh.
Despite that, I see the /tmp/krb5cc_uid files accumulating over time.
But I've tried testing this, and I haven't been able to determine
exactly what creates those files.  And I don't understand why new
krb5cc_uid files are created when there is an existing, valid file
already.  Clearly some programs ignore existing files, and some create
new ones.
> And there's nothing in the logs, correct? Have you tried attaching
strace
> to one of those, and see if you can get a clue as to what's happening?
Actually, I get this in the log:

Mar 22 13:25:09 daemon.err lnxdev108 rpc.gssd[19329]: WARNING:
handle_gssd_upcall: failed to find uid in upcall string 'mech=krb5'

Thanks,
Matt

Matt Garman

2017-Mar-23 15:13 UTC

head link

[CentOS] kerberized-nfs - any experts out there?

On Wed, Mar 22, 2017 at 6:11 PM, John Jasen <jjasen at realityfailure.org>
wrote:> On 03/22/2017 03:26 PM, Matt Garman wrote:
>> Is anyone on the list using kerberized-nfs on any kind of scale?
>
> Not for a good many years.
>
> Are you using v3 or v4 NFS?
v4.  I think you can only do kerberized NFS with v4.

> Also, you can probably stuff the rpc.gss* and idmapd services into
> verbose mode, which may give you a better ideas as to whats going on.
I do that.  The logs are verbose, but generally too cryptic for me to
make sense of.  Web searches on the errors yield results at best 50%
of the time, and the hits almost never have a solution.
> And yes, the kernel does some kerberos caching. I think 10 to 15 minutes.
To me it looks like it's more on the order of an hour.  For example, a
simple test I've done is to do a "fresh" login on a server.  The
server has just been rebooted, and with the reboot, all the
/tmp/krb5cc* files were deleted.

I login via ssh, which implicitly establishes my Kerberos tickets.  I
deliberately do a "kdestroy".  Then I have a simple shell loop like
this:

while [ 1 ] ; do date ; ls ; sleep 30s ; done

Which is just doing an ls on my home directory, which is a kerberized
NFS mount.  Despite having done a kdestroy, this works, presumably
from cached credentials.  And it continues to work for *about* an
hour, and then I start getting permission denied.  I emphasized
"about" because it's not precisely one hour, but seems to range
from
maybe 55 to 65 minutes.

But, that's a super-simple, controlled test.  What happens when you
add screen multiplexers (tmux, gnu screen) into the mix.  What if you
login "fresh" via password versus having your gss (kerberos)
credentials forwarded?  What if you're logged in multiple times on the
same machine by via different methods?

Seemingly Similar Threads

Search for more possibly parallel threads

CentOS - Mar 2017 - kerberized-nfs - any experts out there?

[CentOS] kerberized-nfs - any experts out there?

[CentOS] kerberized-nfs - any experts out there?

[CentOS] kerberized-nfs - any experts out there?

[CentOS] kerberized-nfs - any experts out there?

[CentOS] kerberized-nfs - any experts out there?

[CentOS] kerberized-nfs - any experts out there?

Seemingly Similar Threads