thr3ads.net - freebsd stable - Panic on BETA1 in the ZFS subsystem [Jul 2016]

If this information is useful, please help other people find it:
Share via:

Andriy Gapon

2016-Jul-21 12:52 UTC

Panic on BETA1 in the ZFS subsystem

On 21/07/2016 15:25, Karl Denninger wrote:> The crash occurred during a backup script operating, which is (roughly)
> the following:
> 
> zpool import -N backup (mount the pool to copy to)
> 
> iterate over a list of zfs filesystems and...
> 
> zfs rename fs at zfs-base fs at zfs-old
> zfs snapshot fs at zfs-base
> zfs send -RI fs at zfs-old fs at zfs-base | zfs receive -Fudv backup
> zfs destroy -vr fs at zfs-old
> 
> The first filesystem to be done is the rootfs, that is when it panic'd,
> and from the traceback it appears that the Zio's in there are from the
> backup volume, so the answer to your question is "yes".
I think that what happened here was that a quite large number of TRIM
requests was queued by ZFS before it had a chance to learn that the
target vdev in the backup pool did not support TRIM.  So, when the the
first request failed with ENOTSUP the vdev was marked as not supporting
TRIM.  After that all subsequent requests were failed without sending
them down the storage stack.  But the way it is done means that all the
requests were processed by the nested zio_execute() calls on the same
stack.  And that lead to the stack overflow.

Steve, do you think that this is a correct description of what happened?

The state of the pools that you described below probably contributed to
the avalanche of TRIMs that caused the problem.
> This is a different panic that I used to get on 10.2 (the other one was
> always in dounmount) and the former symptom was also not immediately
> reproducable; whatever was blowing it up before was in-core, and a
> reboot would clear it.  This one is not; I (foolishly) believed that the
> operation would succeed after the reboot and re-attempted it, only to
> get an immediate repeat of the same panic (with an essentially-identical
> traceback.)
> 
> What allowed the operation to succeed was removing *all* of the
> snapshots (other than the base filesystem, of course) from both the
> source *and* backup destination zpools, then re-running the operation. 
> That causes a "base" copy to be taken (zfs snapshot fs at
zfs-base and then
> just a straight send of that instead of an incremental), which was
> successful.
> 
> The only thing that was odd about the zfs filesystem in question was
> that as a boot environment that was my roll-forward to 11.0 its
"origin"
> was a clone of 10.2 before the install was done, so that snapshot was
> present in the zfs snapshot list.  However, it had been present for
> several days without incident, so I doubt its presence was involved in
> the creation of the circumstances leading to the panic.

-- 
Andriy Gapon

Karl Denninger

2016-Jul-21 13:04 UTC

head link

Panic on BETA1 in the ZFS subsystem

On 7/21/2016 07:52, Andriy Gapon wrote:> On 21/07/2016 15:25, Karl Denninger wrote:
>> The crash occurred during a backup script operating, which is (roughly)
>> the following:
>>
>> zpool import -N backup (mount the pool to copy to)
>>
>> iterate over a list of zfs filesystems and...
>>
>> zfs rename fs at zfs-base fs at zfs-old
>> zfs snapshot fs at zfs-base
>> zfs send -RI fs at zfs-old fs at zfs-base | zfs receive -Fudv backup
>> zfs destroy -vr fs at zfs-old
>>
>> The first filesystem to be done is the rootfs, that is when it
panic'd,
>> and from the traceback it appears that the Zio's in there are from
the
>> backup volume, so the answer to your question is "yes".
> I think that what happened here was that a quite large number of TRIM
> requests was queued by ZFS before it had a chance to learn that the
> target vdev in the backup pool did not support TRIM.  So, when the the
> first request failed with ENOTSUP the vdev was marked as not supporting
> TRIM.  After that all subsequent requests were failed without sending
> them down the storage stack.  But the way it is done means that all the
> requests were processed by the nested zio_execute() calls on the same
> stack.  And that lead to the stack overflow.
>
> Steve, do you think that this is a correct description of what happened?
>
> The state of the pools that you described below probably contributed to
> the avalanche of TRIMs that caused the problem.
>
The source for the backup a pool that is comprised entirely of SSDs (and
thus supports TRIM), and the target is a pair of spinning rust devices
(which of course do not support TRIM); the incremental receive to that
pool does (of course) remove all the obsolete snapshots.....

What I don't understand however, is why it has been running fine for a
week or so, and why it immediately repeated the panic on a retry attempt
-- or how to prevent it, at least at this point.  I certainly do not
want to leave the pool mounted when not in active backup use.

-- 
Karl Denninger
karl at denninger.net <mailto:karl at denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2996 bytes
Desc: S/MIME Cryptographic Signature
URL:
<http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20160721/6216b7d4/attachment.bin>

Steven Hartland

2016-Jul-22 13:39 UTC

head link

Panic on BETA1 in the ZFS subsystem

On 21/07/2016 13:52, Andriy Gapon wrote:> On 21/07/2016 15:25, Karl Denninger wrote:
>> The crash occurred during a backup script operating, which is (roughly)
>> the following:
>>
>> zpool import -N backup (mount the pool to copy to)
>>
>> iterate over a list of zfs filesystems and...
>>
>> zfs rename fs at zfs-base fs at zfs-old
>> zfs snapshot fs at zfs-base
>> zfs send -RI fs at zfs-old fs at zfs-base | zfs receive -Fudv backup
>> zfs destroy -vr fs at zfs-old
>>
>> The first filesystem to be done is the rootfs, that is when it
panic'd,
>> and from the traceback it appears that the Zio's in there are from
the
>> backup volume, so the answer to your question is "yes".
> I think that what happened here was that a quite large number of TRIM
> requests was queued by ZFS before it had a chance to learn that the
> target vdev in the backup pool did not support TRIM.  So, when the the
> first request failed with ENOTSUP the vdev was marked as not supporting
> TRIM.  After that all subsequent requests were failed without sending
> them down the storage stack.  But the way it is done means that all the
> requests were processed by the nested zio_execute() calls on the same
> stack.  And that lead to the stack overflow.
>
> Steve, do you think that this is a correct description of what happened?
>
> The state of the pools that you described below probably contributed to
> the avalanche of TRIMs that caused the problem.
>Yes does indeed sound like what happened to me.

     Regards
     Steve

freebsd stable - Jul 2016 - Panic on BETA1 in the ZFS subsystem

Panic on BETA1 in the ZFS subsystem

Panic on BETA1 in the ZFS subsystem

Panic on BETA1 in the ZFS subsystem