thr3ads.net - freebsd stable - nfs lockd errors after NetApp software upgrade. [Dec 2019]

If this information is useful, please help other people find it:
Share via:

Rick Macklem

2019-Dec-21 17:32 UTC

nfs lockd errors after NetApp software upgrade.

Daniel Braniss wrote:>>On 20 Dec 2019, at 19:19, Rick Macklem >><rmacklem at
uoguelph.ca<mailto:rmacklem at uoguelph.ca>> wrote:
>>
>>Adam McDougall wrote:
>>>Try changing bool_t do_tcp = FALSE; to TRUE in
>>>/usr/src/sys/nlm/nlm_prot_impl.c, recompile the kernel and try
again. I
>>>think this makes it match Linux client behavior. I suspect I ran
into
>>>the same issue as you. I do think I used nolockd is a workaround
>>>temporarily. I can provide some more details if it works.
>>If this fixes the problem, please let me know.
>>
>>I'm not sure I'd want to change the default, since it might
break things for
>>others, but I can definitely make it a tunable, so that people don't
need to
>>recompile a kernel to deal with it.
>>
>>
>great! I was just about to see how it can be done(tunable) but need to check
if it can >be done
>at any time, or just at boot time.I haven't looked at the code, but I suspect changing it on the fly could
cause problems,
so I am inclined to make it a tunable (boot time only).
>thanks.
>btw, currently, from several hours of analysing the traffic, it seems that
nlm is UDP.I assume that means you haven't tried flipping it to TCP yet.

Please let us know how it goes, rick

danny


rick

On 12/19/19 9:21 AM, Daniel Braniss wrote:


On 19 Dec 2019, at 16:09, Rick Macklem <rmacklem at
uoguelph.ca<mailto:rmacklem at uoguelph.ca>> wrote:

Daniel Braniss wrote:
[stuff snipped]
all mounts are nfsv3/tcp
This doesn't affect what the NLM code (rpc.lockd) uses. I honestly don't
know when
the NLM uses tcp vs udp. I think rpc.statd still uses IP broadcast at times.
can the replay cache have any influence here? I tend to remember way back issues
with it,

To me, it looks like a network configuration issue.
that was/is my gut feelings too, but, as far as we can tell, nothing has changed
in the network infrastructure,
the problems appeared after the NetAPP?s software was updated, it was working
fine till then.

the problems are also happening on freebsd 12.1

You could capture packets (maybe when a client first starts rpc.statd and
rpc.lockd)
and then look at them in wireshark. I'd disable statup of rpc.lockd and
rpc.statd
at boot for a test client and then run something like:
# tcpdump -s 0 -s out.pcap host <netapp-host>
- and then start rpc.statd and rpc.lockd
Then I'd look at out.pcap in wireshark (much better at decoding this stuff
than
tcpdump). I'd look for things like different reply IP addresses from the
Netapp,
which might confuse this tired old NLM protocol Sun devised in the mid-1980s.

it?s going to be an interesting week end :-(

the error is also appearing on freebsd-11.2-stable, I?m now checking if it?s
also
happening on 12.1
btw, the NetApp version is 9.3P17
Yes. I wasn't the author of the NSM and NLM code (long ago I refused to even
try to implement it, because I knew the protocol was badly broken) and I avoid
fiddling with. As such, it won't have change much since around FreeBSD7.
and we haven?t had any issues with it for years, so you must have done something
good

cheers,
     danny


rick

cheers,
      danny

rick

Cheers

Richard
(NetApp admin)

On Wed, 18 Dec 2019 at 15:46, Daniel Braniss <danny at
cs.huji.ac.il<mailto:danny at cs.huji.ac.il><mailto:danny at
cs.huji.ac.il>> wrote:


On 18 Dec 2019, at 16:55, Rick Macklem <rmacklem at
uoguelph.ca<mailto:rmacklem at uoguelph.ca><mailto:rmacklem at
uoguelph.ca>> wrote:

Daniel Braniss wrote:

Hi,
The server with the problems is running FreeBSD 11.1 stable, it was working fine
for >several months,
but after a software upgrade of our NetAPP server it?s reporting many lockd
errors >and becomes catatonic,
...
Dec 18 13:11:02 moo-09 kernel: nfs server fr-06:/web/www: lockd not responding
Dec 18 13:11:45 moo-09 last message repeated 7 times
Dec 18 13:12:55 moo-09 last message repeated 8 times
Dec 18 13:13:10 moo-09 kernel: nfs server fr-06:/web/www: lockd is alive again
Dec 18 13:13:10 moo-09 last message repeated 8 times
Dec 18 13:13:29 moo-09 kernel: sonewconn: pcb 0xfffff8004cc051d0: Listen queue
>overflow: 194 already in queue awaiting acceptance (1 occurrences)
Dec 18 13:14:29 moo-09 kernel: sonewconn: pcb 0xfffff8004cc051d0: Listen queue
>overflow: 193 already in queue awaiting acceptance (3957 occurrences)
Dec 18 13:15:29 moo-09 kernel: sonewconn: pcb 0xfffff8004cc051d0: Listen queue
>overflow: 193 already in queue awaiting acceptance ?
Seems like their software upgrade didn't improve handling of NLM RPCs?
Appears to be handling RPCs slowly and/or intermittently. Note that no one
tests it with IPv6, so at least make sure you are still using IPv4 for the
mounts and
try and make sure IP broadcast works between client and Netapp. I think the NLM
and NSM (rpc.statd) still use IP broadcast sometimes.

we are ipv4 - we have our own class c :-)
Maybe the network guys can suggest more w.r.t. why, but as I've stated
before,
the NLM is a fundamentally broken protocol which was never published by Sun,
so I suggest you avoid using it if at all possible.
well, at the moment the ball is on NetAPP court, and switching to NFSv4 at the
moment is out of the question, it?s
a production server used by several thousand students.


- If the locks don't need to be seen by other clients, you can just use the
"nolockd"
mount option.
or
- If locks need to be seen by other clients, try NFSv4 mounts. Netapp filers
should support NFSv4.1, which is a much better protocol that NFSv4.0.

Good luck with it, rick
thanks
     danny

?
any ideas?

thanks,
    danny

_______________________________________________
freebsd-stable at freebsd.org<mailto:freebsd-stable at
freebsd.org><mailto:freebsd-stable at freebsd.org> mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe at
freebsd.org<mailto:freebsd-stable-unsubscribe at freebsd.org>"

_______________________________________________
freebsd-stable at freebsd.org<mailto:freebsd-stable at
freebsd.org><mailto:freebsd-stable at freebsd.org> mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe at
freebsd.org<mailto:freebsd-stable-unsubscribe at freebsd.org>"


_______________________________________________
freebsd-stable at freebsd.org<mailto:freebsd-stable at freebsd.org>
mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe at
freebsd.org"


_______________________________________________
freebsd-stable at freebsd.org<mailto:freebsd-stable at freebsd.org>
mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe at
freebsd.org<mailto:freebsd-stable-unsubscribe at freebsd.org>"
_______________________________________________
freebsd-stable at freebsd.org<mailto:freebsd-stable at freebsd.org>
mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe at
freebsd.org<mailto:freebsd-stable-unsubscribe at freebsd.org>"

Daniel Braniss

2019-Dec-22 06:18 UTC

head link

nfs lockd errors after NetApp software upgrade.

> On 21 Dec 2019, at 19:32, Rick Macklem <rmacklem at uoguelph.ca>
wrote:
> 
> Daniel Braniss wrote:
>>> On 20 Dec 2019, at 19:19, Rick Macklem >><rmacklem at
uoguelph.ca<mailto:rmacklem at uoguelph.ca>> wrote:
>>> 
>>> Adam McDougall wrote:
>>>> Try changing bool_t do_tcp = FALSE; to TRUE in
>>>> /usr/src/sys/nlm/nlm_prot_impl.c, recompile the kernel and try
again. I
>>>> think this makes it match Linux client behavior. I suspect I
ran into
>>>> the same issue as you. I do think I used nolockd is a
workaround
>>>> temporarily. I can provide some more details if it works.
>>> If this fixes the problem, please let me know.
>>> 
>>> I'm not sure I'd want to change the default, since it might
break things for
>>> others, but I can definitely make it a tunable, so that people
don't need to
>>> recompile a kernel to deal with it.
>>> 
>>> 
>> great! I was just about to see how it can be done(tunable) but need to
check if it can >be done
>> at any time, or just at boot time.
> I haven't looked at the code, but I suspect changing it on the fly
could cause problems,
> so I am inclined to make it a tunable (boot time only).
my feelings too.> 
>> thanks.
>> btw, currently, from several hours of analysing the traffic, it seems
that nlm is UDP.
> I assume that means you haven't tried flipping it to TCP yet.I will soon, but I have my doubts, the problem is caused my multiple events,
i.e, it happened once while
I was doing svn checkout, but i have done it several times since, and no issues.
So it must be
an aggregation of factors. Other hosts are reporting locks times too.

danny
> 
> Please let us know how it goes, rick
> 
> danny
> 
> 
> rick
> 
> On 12/19/19 9:21 AM, Daniel Braniss wrote:
> 
> 
> On 19 Dec 2019, at 16:09, Rick Macklem <rmacklem at
uoguelph.ca<mailto:rmacklem at uoguelph.ca>> wrote:
> 
> Daniel Braniss wrote:
> [stuff snipped]
> all mounts are nfsv3/tcp
> This doesn't affect what the NLM code (rpc.lockd) uses. I honestly
don't know when
> the NLM uses tcp vs udp. I think rpc.statd still uses IP broadcast at
times.
> can the replay cache have any influence here? I tend to remember way back
issues
> with it,
> 
> To me, it looks like a network configuration issue.
> that was/is my gut feelings too, but, as far as we can tell, nothing has
changed in the network infrastructure,
> the problems appeared after the NetAPP?s software was updated, it was
working fine till then.
> 
> the problems are also happening on freebsd 12.1
> 
> You could capture packets (maybe when a client first starts rpc.statd and
rpc.lockd)
> and then look at them in wireshark. I'd disable statup of rpc.lockd and
rpc.statd
> at boot for a test client and then run something like:
> # tcpdump -s 0 -s out.pcap host <netapp-host>
> - and then start rpc.statd and rpc.lockd
> Then I'd look at out.pcap in wireshark (much better at decoding this
stuff than
> tcpdump). I'd look for things like different reply IP addresses from
the Netapp,
> which might confuse this tired old NLM protocol Sun devised in the
mid-1980s.
> 
> it?s going to be an interesting week end :-(
> 
> the error is also appearing on freebsd-11.2-stable, I?m now checking if
it?s also
> happening on 12.1
> btw, the NetApp version is 9.3P17
> Yes. I wasn't the author of the NSM and NLM code (long ago I refused to
even
> try to implement it, because I knew the protocol was badly broken) and I
avoid
> fiddling with. As such, it won't have change much since around
FreeBSD7.
> and we haven?t had any issues with it for years, so you must have done
something good
> 
> cheers,
>     danny
> 
> 
> rick
> 
> cheers,
>      danny
> 
> rick
> 
> Cheers
> 
> Richard
> (NetApp admin)
> 
> On Wed, 18 Dec 2019 at 15:46, Daniel Braniss <danny at
cs.huji.ac.il<mailto:danny at cs.huji.ac.il><mailto:danny at
cs.huji.ac.il>> wrote:
> 
> 
> On 18 Dec 2019, at 16:55, Rick Macklem <rmacklem at
uoguelph.ca<mailto:rmacklem at uoguelph.ca><mailto:rmacklem at
uoguelph.ca>> wrote:
> 
> Daniel Braniss wrote:
> 
> Hi,
> The server with the problems is running FreeBSD 11.1 stable, it was working
fine for >several months,
> but after a software upgrade of our NetAPP server it?s reporting many lockd
errors >and becomes catatonic,
> ...
> Dec 18 13:11:02 moo-09 kernel: nfs server fr-06:/web/www: lockd not
responding
> Dec 18 13:11:45 moo-09 last message repeated 7 times
> Dec 18 13:12:55 moo-09 last message repeated 8 times
> Dec 18 13:13:10 moo-09 kernel: nfs server fr-06:/web/www: lockd is alive
again
> Dec 18 13:13:10 moo-09 last message repeated 8 times
> Dec 18 13:13:29 moo-09 kernel: sonewconn: pcb 0xfffff8004cc051d0: Listen
queue >overflow: 194 already in queue awaiting acceptance (1 occurrences)
> Dec 18 13:14:29 moo-09 kernel: sonewconn: pcb 0xfffff8004cc051d0: Listen
queue >overflow: 193 already in queue awaiting acceptance (3957 occurrences)
> Dec 18 13:15:29 moo-09 kernel: sonewconn: pcb 0xfffff8004cc051d0: Listen
queue >overflow: 193 already in queue awaiting acceptance ?
> Seems like their software upgrade didn't improve handling of NLM RPCs?
> Appears to be handling RPCs slowly and/or intermittently. Note that no one
> tests it with IPv6, so at least make sure you are still using IPv4 for the
mounts and
> try and make sure IP broadcast works between client and Netapp. I think the
NLM
> and NSM (rpc.statd) still use IP broadcast sometimes.
> 
> we are ipv4 - we have our own class c :-)
> Maybe the network guys can suggest more w.r.t. why, but as I've stated
before,
> the NLM is a fundamentally broken protocol which was never published by
Sun,
> so I suggest you avoid using it if at all possible.
> well, at the moment the ball is on NetAPP court, and switching to NFSv4 at
the moment is out of the question, it?s
> a production server used by several thousand students.
> 
> 
> - If the locks don't need to be seen by other clients, you can just use
the "nolockd"
> mount option.
> or
> - If locks need to be seen by other clients, try NFSv4 mounts. Netapp
filers
> should support NFSv4.1, which is a much better protocol that NFSv4.0.
> 
> Good luck with it, rick
> thanks
>     danny
> 
> ?
> any ideas?
> 
> thanks,
>    danny
> 
> _______________________________________________
> freebsd-stable at freebsd.org<mailto:freebsd-stable at
freebsd.org><mailto:freebsd-stable at freebsd.org> mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscribe at
freebsd.org<mailto:freebsd-stable-unsubscribe at freebsd.org>"
> 
> _______________________________________________
> freebsd-stable at freebsd.org<mailto:freebsd-stable at
freebsd.org><mailto:freebsd-stable at freebsd.org> mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscribe at
freebsd.org<mailto:freebsd-stable-unsubscribe at freebsd.org>"
> 
> 
> _______________________________________________
> freebsd-stable at freebsd.org<mailto:freebsd-stable at freebsd.org>
mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscribe at
freebsd.org"
> 
> 
> _______________________________________________
> freebsd-stable at freebsd.org<mailto:freebsd-stable at freebsd.org>
mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscribe at
freebsd.org<mailto:freebsd-stable-unsubscribe at freebsd.org>"
> _______________________________________________
> freebsd-stable at freebsd.org<mailto:freebsd-stable at freebsd.org>
mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscribe at
freebsd.org<mailto:freebsd-stable-unsubscribe at freebsd.org>"
>

freebsd stable - Dec 2019 - nfs lockd errors after NetApp software upgrade.

nfs lockd errors after NetApp software upgrade.

nfs lockd errors after NetApp software upgrade.