thr3ads.net - freebsd stable - unionfs bugs, a partial patch and some comments [Was: Re: 1-BETA3 Panic: __lockmgr_args: downgrade a recursed lockmgr nfs @ /usr/local/share/deploy-tools/RELENG_11/src/sys/fs/unionfs/union

If this information is useful, please help other people find it:
Share via:

Harry Schmalzbauer

2017-Mar-07 12:42 UTC

unionfs bugs, a partial patch and some comments [Was: Re: 1-BETA3 Panic: __lockmgr_args: downgrade a recursed lockmgr nfs @ /usr/local/share/deploy-tools/RELENG_11/src/sys/fs/unionfs/union_vnops.c:1905]

Bez?glich Rick Macklem's Nachricht vom 05.09.2016 23:21
(localtime):> Harry Schmalzbauer <freebsd at omnilan.de> wrote:
>>Bez?glich Rick Macklem's Nachricht vom 18.08.2016 02:03 (localtime):
>>>  Kostik wrote:
>>> [stuff snipped]
>>>> insmnque() performs the cleanup on its own, and that default
cleanup
> isnot suitable >for the situation.  I think that insmntque1() would
> betterfit your requirements, your >need to move the common code into a
> helper.It seems that >unionfs_ins_cached_vnode() cleanup could reuse it.
>>> <https://lists.freebsd.org>
>>> I've attached an updated patch (untested like the last one).
This one
> creates a
>>> custom version insmntque_stddtr() that first calls
unionfs_noderem()
> and then
>>> does the same stuff as insmntque_stddtr(). This looks like it does
the
> required
>>> stuff (unionfs_noderem() is what the unionfs VOP_RECLAIM() does).
>>> It switches the node back to using its own v_vnlock that is
> exclusively locked,
>>> among other things.
>>
>>Thanks a lot, today I gave it a try.
>>
>>With this patch, one reproducable panic can still be easily triggered:
>>        I have directory A unionfs_mounted under directory B.
>>Then I mount_unionfs the same directory A below another directory C.
>>panic: __lockmgr_args: downgrade a recursed lockmgr nfs @
>>/usr/local/share/deploy-tools/RELENG_11/src/sys/fs/unionfs/union_vnops.c:1905
>>Result is this backtrace, hardly helpful I guess:
>>
>>#1  0xffffffff80ae5fd9 in kern_reboot (howto=260) at
>>/usr/local/share/deploy-tools/RELENG_11/src/sys/kern/kern_shutdown.c:366
>>#2  0xffffffff80ae658b in vpanic (fmt=<value optimized out>,
ap=<value
>>optimized out>)
>>    at
>>/usr/local/share/deploy-tools/RELENG_11/src/sys/kern/kern_shutdown.c:759
>>#3  0xffffffff80ae63c3 in panic (fmt=0x0) at
>>/usr/local/share/deploy-tools/RELENG_11/src/sys/kern/kern_shutdown.c:690
>>#4  0xffffffff80ab7ab7 in __lockmgr_args (lk=<value optimized
out>,
>>flags=<value optimized out>, ilk=<value optimized out>,
wmesg=<value
>>optimized out>,
>>    pri=<value optimized out>, timo=<value optimized out>,
file=<value
>>optimized out>, line=<value optimized out>)
>>  >   at
> /usr/local/share/deploy-tools/RELENG_11/src/sys/kern/kern_lock.c:992
>>#5  0xffffffff80ba510c in vop_stdlock (ap=<value optimized out>)
at
>>lockmgr.h:98
>>#6  0xffffffff8111932d in VOP_LOCK1_APV (vop=<value optimized
out>,
>>a=<value optimized out>) at vnode_if.c:2087
>>#7  0xffffffff80a18cfc in unionfs_lock (ap=0xfffffe007a3ba6a0) at
>>vnode_if.h:859
>>#8  0xffffffff8111932d in VOP_LOCK1_APV (vop=<value optimized
out>,
>>a=<value optimized out>) at vnode_if.c:2087
>>#9  0xffffffff80bc9b93 in _vn_lock (vp=<value optimized out>,
>>flags=66560, file=<value optimized out>, line=<value optimized
out>) at
>>vnode_if.h:859
>>#10 0xffffffff80a18460 in unionfs_readdir (ap=<value optimized
out>) at
>>/usr/local/share/deploy-tools/RELENG_11/src/sys/fs/unionfs/union_vnops.c:1531
>>#11 0xffffffff81118ecf in VOP_READDIR_APV (vop=<value optimized
out>,
>>a=<value optimized out>) at vnode_if.c:1822
>>#12 0xffffffff80bc6e3b in kern_getdirentries (td=<value optimized
out>,
>>fd=<value optimized out>, buf=0x800c3d000 <Address 0x800c3d000
out of
>>bounds>,
>>    count=<value optimized out>, basep=0xfffffe007a3ba980,
residp=0x0)
>>at vnode_if.h:758
>>#13 0xffffffff80bc6bf8 in sys_getdirentries (td=0x0,
>>uap=0xfffffe007a3baa40) at
>>/usr/local/share/deploy-tools/RELENG_11/src/sys/kern/vfs_syscalls.c:3940
>>#14 0xffffffff80fad6b8 in amd64_syscall (td=<value optimized out>,
>>traced=0) at subr_syscall.c:135
>>#15 0xffffffff80f8feab in Xfast_syscall () at
>>/usr/local/share/deploy-tools/RELENG_11/src/sys/amd64/amd64/exception.S:396
>>#16 0x0000000000452eea in ?? ()
>>Previous frame inner to this frame (corrupt stack?
> Ok, I finally got around to looking at this and the panic() looks like a
> pretty straightforward
> bug in the unionfs code.
> - In unionfs_readdir(), it does a vn_lock(..LK_UPGRADE) and then later
> in the code
>   vn_lock(..LK_DOWNGRADE) if it did the upgrade. (At line#1531 as noted
> in the backtrace.)
>   - In unionfs_lock(), it sets LK_CANRECURSE when it is the rootvp and
> LK_EXCLUSIVE.
>    (So it allows recursive acquisition in this case.)
> --> Then it would call vn_lock(..LK_DOWNGRADE), which would panic if it
> has recursed.
> 
> Now, I'll admit unionfs_lock() is too obscure for me to understand,
but...
> Is it necessary to vn_lock(..LK_DOWNGRADE) or can unionfs_readdir() just
> return
> with the vnode exclusively locked?
> (It would be easy to change the code to avoid the
> vn_lock(..LK_DOWNGRADE) call
>  when it has done the vn_lock(..LK_EXCLUSIVE) after
> vn_lock(..LK_UPGRADE) fails.)
> 
> rick
> 
>>I ran your previous patch with for some time.
>>Similarly, mounting one directory below a 2nd mountpount crashed the
>>machine (forgot to config dumpdir, so can't compare backtrace with
the
>>current patch).
>>Otherwise, at least with the previous patch, I haven't had any other
>>panic for about one week.
Something ufs related seems to have tightened the unionfs locking
problem in stable/11.  Now the machine instantaniously panics during
boot after mounting root with Rick's latest patch.

Unfortunately I don't have SWAP available on that machine (yet), but
maybe shit is a hint for anybody.

db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame
0xfffffe00982220e0
vpanic() at vpanic+0x186/frame 0xfffffe0098222160
kassert_panic() at kassert_panic+0x126/frame 0xfffffe00982221d0
witness_assert() at witness_assert+0x35a/frame 0xfffffe0098222230
__lockmgr_args() at __lockmgr_args+0x517/frame 0xfffffe00982222d0
vop_stdunlock() at vop_stdunlock+0x3b/frame 0xfffffe00982222f0
VOP_UNLOCK_APV() at VOP_UNLOCK_APV+0xe0/frame 0xfffffe0098222320
unionfs_unlock() at unionfs_unlock+0x112/frame 0xfffffe0098222390
VOP_UNLOCK_APV() at VOP_UNLOCK_APV+0xe0/frame 0xfffffe00982223c0
unionfs_nodeget() at unionfs_nodeget+0x3ef/frame 0xfffffe0098222470
unionfs_domount() at unionfs_domount+0x518/frame 0xfffffe00982226b0
vfs_donmount() at vfs_donmount+0xe37/frame 0xfffffe00982228f0
sys_nmount() at sys_nmount+0x72/frame 0xfffffe0098222930
amd64_syscall() at amd64_syscall+0x2f9/frame 0xfffffe0098222ab0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0098222ab0
--- syscall (378, FreeBSD ELF64, sys_nmount), rip = 0x80086ecea, rsp
0x7fffffffe318, rbp = 0x7fffffffeca0 ---

-harry

Harry Schmalzbauer

2017-Mar-07 18:44 UTC

head link

unionfs bugs, a partial patch and some comments [Was: Re: 1-BETA3 Panic: __lockmgr_args: downgrade a recursed lockmgr nfs @ /usr/local/share/deploy-tools/RELENG_11/src/sys/fs/unionfs/union_vnops.c:1905]

Bez?glich Harry Schmalzbauer's Nachricht vom 07.03.2017 13:42 (localtime):
?> Something ufs related seems to have tightened the unionfs locking
> problem in stable/11.  Now the machine instantaniously panics during
> boot after mounting root with Rick's latest patch.
>
> Unfortunately I don't have SWAP available on that machine (yet), but
> maybe shit is a hint for anybody.
>
> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame
> 0xfffffe00982220e0
> vpanic() at vpanic+0x186/frame 0xfffffe0098222160
> kassert_panic() at kassert_panic+0x126/frame 0xfffffe00982221d0
> witness_assert() at witness_assert+0x35a/frame 0xfffffe0098222230
> __lockmgr_args() at __lockmgr_args+0x517/frame 0xfffffe00982222d0
> vop_stdunlock() at vop_stdunlock+0x3b/frame 0xfffffe00982222f0
> VOP_UNLOCK_APV() at VOP_UNLOCK_APV+0xe0/frame 0xfffffe0098222320
> unionfs_unlock() at unionfs_unlock+0x112/frame 0xfffffe0098222390
> VOP_UNLOCK_APV() at VOP_UNLOCK_APV+0xe0/frame 0xfffffe00982223c0
> unionfs_nodeget() at unionfs_nodeget+0x3ef/frame 0xfffffe0098222470
> unionfs_domount() at unionfs_domount+0x518/frame 0xfffffe00982226b0
> vfs_donmount() at vfs_donmount+0xe37/frame 0xfffffe00982228f0
> sys_nmount() at sys_nmount+0x72/frame 0xfffffe0098222930
> amd64_syscall() at amd64_syscall+0x2f9/frame 0xfffffe0098222ab0
> Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0098222ab0
> --- syscall (378, FreeBSD ELF64, sys_nmount), rip = 0x80086ecea, rsp >
0x7fffffffe318, rbp = 0x7fffffffeca0 ---
New discovery:
Rick's latest patch casues panic only with KDB. If I compile a kernel
without witenss and KDB, the machine boots fine!
Also, it's at least not so easy anymore to trigger the deadlock :-) . I
need to do more testing but until now Rick's approach seems very
promising :-) . Unfortunately I can't provide a fix or suggestion to why
the KDB kernel panics and the non-KDB doesn't, just the dull imagination
it could be that additional locking checks (KASSERT?), preventing more
damage, are not in place. So I guess I'm in danger waters, but it
defenitly is a highly appreciated improvement for me and my bery best
bet for now (neither eliminating unionfs nor holding off 11 updates were
real options for me, especially because unionfs isn't really well
wokring on 10.3 either, just not leading to deadlocks in more environments)!

I tried the non-debug kernel because I browsed old unionfs discussions
and desperately gave Attilio Rao's patch a try since I couldn't remember
why I haven't kept it locally:
https://people.freebsd.org/~attilio/unionfs_nodeget4.patch (he tried to
solve unionfs problems for RELENG_9 back in 2012:
https://lists.freebsd.org/pipermail/freebsd-stable/2012-November/070358.html)

It's still true that his patch leads to a panic with debugging kernel ?
only. Same patch without KDB allows to boot and start squid. But the
result is the same as with plain r314856, the system deadlocks reproducibly.

Also, the trace with his patch looks identical to the plain r314856
unionfs panic.

So I hope Rick or someone else can pick up the latest patch and polish
it to make KDB-kernels happy :-)
I can offer a small donation if that helps!
Of course, I'll also provide KDB info if needed/helpful.

thanks,

-harry

freebsd stable - Mar 2017 - unionfs bugs, a partial patch and some comments [Was: Re: 1-BETA3 Panic: __lockmgr_args: downgrade a recursed lockmgr nfs @ /usr/local/share/deploy-tools/RELENG_11/src/sys/fs/unionfs/union_vnops.c:1905]

unionfs bugs, a partial patch and some comments [Was: Re: 1-BETA3 Panic: __lockmgr_args: downgrade a recursed lockmgr nfs @ /usr/local/share/deploy-tools/RELENG_11/src/sys/fs/unionfs/union_vnops.c:1905]

unionfs bugs, a partial patch and some comments [Was: Re: 1-BETA3 Panic: __lockmgr_args: downgrade a recursed lockmgr nfs @ /usr/local/share/deploy-tools/RELENG_11/src/sys/fs/unionfs/union_vnops.c:1905]