On 12/22/19 12:01 PM, Rick Macklem wrote:
> Well, I've noted the flawed protocol. Here's an example (from my
limited understanding of these protocols, where there has never been a published
spec) :
> - The NLM supports a "blocking lock request" that goes something
like this...
> - client requests lock and is willing to wait for it
> - if server has a conflicting lock on the file, it replies
"I'll acquire the lock for
> you when I can and let you know".
> --> When the conflicting lock is released, the server acquires the
lock and does
> a callback (server->client RPC) to tell the client it now
has the lock.
> You don't have to think about this for long to realize that any network
unreliability
> or partitioning could result in trouble.
> The kernel RPC layer may do some retries of the RPCs (this is controlled by
the
> parameters set for the RPC), but at some point the protocol asks the NSM
> (rpc.statd) if the machine is "up" and then uses the NSM's
answer to deal with it.
> (The NSM basically pokes other systems and notes they are "up" if
they get
> replies to these pokes. It uses IP broadcast at some point.)
>
> Now, maybe switching to TCP will make the RPCs reliable enough that it will
> work, or maybe it won't? (It certainly sounds like the Netapp upgrade
is causing
> some kind of network issue, and the NLM doesn't tolerate that well.)
>
> rick
tl;dr I think netapp effectively nerfed UDP lockd performance in newer
versions, maybe cluster mode.
>From my very un-fun experience after migrating our volumes off an older
netapp onto a new netapp with flash drives (plenty fast) running Ontap
9.x ("cluster mode"), our typical IO load from idle time IMAP
connections was enough to overwhelm the new netapp and drive performance
into the ground. The same IO that was perfectly fine on the old netapp.
Going into a workday in this state was absolutely not possible. I opened
a high priority ticket with netapp, didn't really get anywhere that very
long day and settled on nolockd so I could go home and sleep. Both my
hunch later and netapp support suggested switching lockd traffic to TCP
even though I had no network problems (the old netapp was fine). I think
I still run into occasional load issues but the newer netapp OS seemed
way more capable of this load when using TCP lockd. Of course they also
suggested switching to nfsv4 but I could not seriously entertain
validating that type of change for production in less than a day.