thr3ads.net - freebsd stable - file system deadlock in RELENG

If this information is useful, please help other people find it:
Share via:

Mike Tancsa

2017-Aug-24 20:01 UTC

file system deadlock in RELENG_11

OK, this is fairly easy to repeat. If I start a sync of a snapshot via
zrep, it hangs the box.  CTRL+T shows


DEBUG: overiding stale lock on zroot/chyves from pid 19378
sending zroot/chyves at zrep_000010 to 10.151.9.2:zroot/chyves
cannot receive new filesystem stream: destination
'zroot/chyves/guests/resi/disk1' exists
must specify -F to overwrite it
^C
load: 0.48  cmd: zfs 29690 [tx->tx_sync_done_cv] 358.94r 0.00u 0.00s 0%
3476k
load: 0.48  cmd: zfs 29690 [tx->tx_sync_done_cv] 360.42r 0.00u 0.00s 0%
3476k
load: 0.48  cmd: zfs 29690 [tx->tx_sync_done_cv] 360.79r 0.00u 0.00s 0%
3476k
load: 0.48  cmd: zfs 29690 [tx->tx_sync_done_cv] 360.99r 0.00u 0.00s 0%
3476k
load: 0.48  cmd: zfs 29690 [tx->tx_sync_done_cv] 361.19r 0.00u 0.00s 0%
3476k
load: 0.48  cmd: zfs 29690 [tx->tx_sync_done_cv] 361.37r 0.00u 0.00s 0%
3476k
load: 0.48  cmd: zfs 29690 [tx->tx_sync_done_cv] 361.55r 0.00u 0.00s 0%
3476k
load: 0.48  cmd: zfs 29690 [tx->tx_sync_done_cv] 361.74r 0.00u 0.00s 0%
3476k
load: 0.48  cmd: zfs 29690 [tx->tx_sync_done_cv] 361.92r 0.00u 0.00s 0%
3476k
load: 0.48  cmd: zfs 29690 [tx->tx_sync_done_cv] 362.11r 0.00u 0.00s 0%
3476k
load: 0.48  cmd: zfs 29690 [tx->tx_sync_done_cv] 362.31r 0.00u 0.00s 0%
3476k
load: 0.48  cmd: zfs 29690 [tx->tx_sync_done_cv] 362.50r 0.00u 0.00s 0%
3476k
load: 0.48  cmd: zfs 29690 [tx->tx_sync_done_cv] 362.69r 0.00u 0.00s 0%
3476k
load: 0.48  cmd: zfs 29690 [tx->tx_sync_done_cv] 362.90r 0.00u 0.00s 0%
3476k
load: 0.52  cmd: zfs 29690 [tx->tx_sync_done_cv] 363.08r 0.00u 0.00s 0%
3476k



On 8/24/2017 11:48 AM, Mike Tancsa wrote:> I upgraded a server yesterday from RELENG_11 from march 2017 to r322800
> (Aug 22) and noticed that under heavy disk IO in a VM, the server is
> locking up.  In the vm, I was doing a large untar and I noticed that
> prior to the lockup, the hypervisor would be struggling to keep up the
> disk writes.  The VM is on a zvol if that makes any difference. A few
> times in the VM, IO would be clogged to the point that the disk would
> timeout in the VM
> 
> Aug 24 08:32:02 kernel: ahcich6: Timeout on slot 14 port 0
> Aug 24 08:32:02 kernel: ahcich6: is 00000000 cs 00000000 ss ffffffff rs
> ffffffff tfd 50 serr 00000000 cmd 0001db17
> Aug 24 08:32:02 kernel: (ada1:ahcich6:0:0:0): WRITE_FPDMA_QUEUED. ACB:
> 61 00 a8 47 d8 40 01 00 00 01 00 00
> Aug 24 08:32:02 kernel: (ada1:ahcich6:0:0:0): CAM status: Command timeout
> Aug 24 08:32:02 kernel: (ada1:ahcich6:0:0:0): Retrying command
> 
> When the parent deadlocks, I cant run anything thats not already in RAM.
> shutdown doesnt work and I have to reboot the box via IPMI.
> 
> Any ideas how to debug this or try and better understand the problem so
> I can at least work around it ?
> 
> 	---Mike
> 

-- 
-------------------
Mike Tancsa, tel +1 519 651 3400
Sentex Communications, mike at sentex.net
Providing Internet services since 1994 www.sentex.net
Cambridge, Ontario Canada   http://www.tancsa.com/

Mike Tancsa

2017-Aug-24 20:10 UTC

head link

file system deadlock in RELENG_11

On 8/24/2017 4:01 PM, Mike Tancsa wrote:> OK, this is fairly easy to repeat. If I start a sync of a snapshot via
> zrep, it hangs the box.  CTRL+T shows
> 
> 
> DEBUG: overiding stale lock on zroot/chyves from pid 19378
> sending zroot/chyves at zrep_000010 to 10.151.9.2:zroot/chyves
> cannot receive new filesystem stream: destination
> 'zroot/chyves/guests/resi/disk1' exists
> must specify -F to overwrite it
> ^C
> load: 0.48  cmd: zfs 29690 [tx->tx_sync_done_cv] 358.94r 0.00u 0.00s 0%
> 3476k
> load: 0.48  cmd: zfs 29690 [tx->tx_sync_done_cv] 360.42r 0.00u 0.00s 0%
I was able to get ps to run and not sure if its helpful or not, but
these are the two unkillable zfs processes


root         5   0.0  0.0       0    128  -  DL   10:09      0:05.58
[zfskern]
root     29683   0.0  0.0    7752   3824  5  DE+  15:53      0:00.32 zfs
send -R -I zroot/chyves at zrep_00000d zroot/chyves at zrep_000010
root     29690   0.0  0.0    7752   3492  5  D+   15:53      0:00.00 zfs
rename zroot/chyves at zrep_000010 zroot/chyves at zrep_000010_unsent


-- 
-------------------
Mike Tancsa, tel +1 519 651 3400
Sentex Communications, mike at sentex.net
Providing Internet services since 1994 www.sentex.net
Cambridge, Ontario Canada   http://www.tancsa.com/

Ronald Klop

2017-Aug-26 10:08 UTC

head link

file system deadlock in RELENG_11

Running procstat -kk <pid> will display on what syscall a process/thread  
is blocking.
That might give some valuable information to people.

Regards,
Ronald.


On Thu, 24 Aug 2017 22:01:25 +0200, Mike Tancsa <mike at sentex.net>
wrote:
> OK, this is fairly easy to repeat. If I start a sync of a snapshot via
> zrep, it hangs the box.  CTRL+T shows
>
>
> DEBUG: overiding stale lock on zroot/chyves from pid 19378
> sending zroot/chyves at zrep_000010 to 10.151.9.2:zroot/chyves
> cannot receive new filesystem stream: destination
> 'zroot/chyves/guests/resi/disk1' exists
> must specify -F to overwrite it
> ^C
> load: 0.48  cmd: zfs 29690 [tx->tx_sync_done_cv] 358.94r 0.00u 0.00s 0%
> 3476k
> load: 0.48  cmd: zfs 29690 [tx->tx_sync_done_cv] 360.42r 0.00u 0.00s 0%
> 3476k
> load: 0.48  cmd: zfs 29690 [tx->tx_sync_done_cv] 360.79r 0.00u 0.00s 0%
> 3476k
> load: 0.48  cmd: zfs 29690 [tx->tx_sync_done_cv] 360.99r 0.00u 0.00s 0%
> 3476k
> load: 0.48  cmd: zfs 29690 [tx->tx_sync_done_cv] 361.19r 0.00u 0.00s 0%
> 3476k
> load: 0.48  cmd: zfs 29690 [tx->tx_sync_done_cv] 361.37r 0.00u 0.00s 0%
> 3476k
> load: 0.48  cmd: zfs 29690 [tx->tx_sync_done_cv] 361.55r 0.00u 0.00s 0%
> 3476k
> load: 0.48  cmd: zfs 29690 [tx->tx_sync_done_cv] 361.74r 0.00u 0.00s 0%
> 3476k
> load: 0.48  cmd: zfs 29690 [tx->tx_sync_done_cv] 361.92r 0.00u 0.00s 0%
> 3476k
> load: 0.48  cmd: zfs 29690 [tx->tx_sync_done_cv] 362.11r 0.00u 0.00s 0%
> 3476k
> load: 0.48  cmd: zfs 29690 [tx->tx_sync_done_cv] 362.31r 0.00u 0.00s 0%
> 3476k
> load: 0.48  cmd: zfs 29690 [tx->tx_sync_done_cv] 362.50r 0.00u 0.00s 0%
> 3476k
> load: 0.48  cmd: zfs 29690 [tx->tx_sync_done_cv] 362.69r 0.00u 0.00s 0%
> 3476k
> load: 0.48  cmd: zfs 29690 [tx->tx_sync_done_cv] 362.90r 0.00u 0.00s 0%
> 3476k
> load: 0.52  cmd: zfs 29690 [tx->tx_sync_done_cv] 363.08r 0.00u 0.00s 0%
> 3476k
>
>
>
> On 8/24/2017 11:48 AM, Mike Tancsa wrote:
>> I upgraded a server yesterday from RELENG_11 from march 2017 to r322800
>> (Aug 22) and noticed that under heavy disk IO in a VM, the server is
>> locking up.  In the vm, I was doing a large untar and I noticed that
>> prior to the lockup, the hypervisor would be struggling to keep up the
>> disk writes.  The VM is on a zvol if that makes any difference. A few
>> times in the VM, IO would be clogged to the point that the disk would
>> timeout in the VM
>>
>> Aug 24 08:32:02 kernel: ahcich6: Timeout on slot 14 port 0
>> Aug 24 08:32:02 kernel: ahcich6: is 00000000 cs 00000000 ss ffffffff rs
>> ffffffff tfd 50 serr 00000000 cmd 0001db17
>> Aug 24 08:32:02 kernel: (ada1:ahcich6:0:0:0): WRITE_FPDMA_QUEUED. ACB:
>> 61 00 a8 47 d8 40 01 00 00 01 00 00
>> Aug 24 08:32:02 kernel: (ada1:ahcich6:0:0:0): CAM status: Command  
>> timeout
>> Aug 24 08:32:02 kernel: (ada1:ahcich6:0:0:0): Retrying command
>>
>> When the parent deadlocks, I cant run anything thats not already in
RAM.
>> shutdown doesnt work and I have to reboot the box via IPMI.
>>
>> Any ideas how to debug this or try and better understand the problem so
>> I can at least work around it ?
>>
>> 	---Mike
>>
>

freebsd stable - Aug 2017 - file system deadlock in RELENG_11

file system deadlock in RELENG_11

file system deadlock in RELENG_11

file system deadlock in RELENG_11