thr3ads.net - freebsd stable - Continuous crashing ZFS server [Jun 2018]

If this information is useful, please help other people find it:
Share via:

Stefan Wendler

2018-Jun-11 12:35 UTC

Continuous crashing ZFS server

Do you use L2ARC/ZIL disks? I had a similar problem that turned out to
be a broken caching SSD. Scrubbing didn't help a bit because it reported
that data was okay. And SMART was fine as well. Fortunately I could
still send/recv snapshots to a backup disk but wasn't able to replace
the SSDs without a pool restore. ZFS just wouldn't sync some older ZIL
data to disk and also wouldn't release the SSDs from the pool. Did you
also check the logs for entries that look like broken RAM?

Cheers,
Stefan

On 06/11/2018 01:29 PM, Willem Jan Withagen wrote:> On 11-6-2018 12:53, Andriy Gapon wrote:
>> On 11/06/2018 13:26, Willem Jan Withagen wrote:
>>> On 11/06/2018 12:13, Andriy Gapon wrote:
>>>> On 08/06/2018 13:02, Willem Jan Withagen wrote:
>>>>> My file server is crashing about every 15 minutes at the
moment.
>>>>> The panic looks like:
>>>>>
>>>>> Jun? 8 11:48:43 zfs kernel: panic: Solaris(panic): zfs:
allocating
>>>>> allocated segment(offset=12922221670400 size=24576)
>>>>> Jun? 8 11:48:43 zfs kernel:
>>>>> Jun? 8 11:48:43 zfs kernel: cpuid = 1
>>>>> Jun? 8 11:48:43 zfs kernel: KDB: stack backtrace:
>>>>> Jun? 8 11:48:43 zfs kernel: #0 0xffffffff80aada57 at
kdb_backtrace+0x67
>>>>> Jun? 8 11:48:43 zfs kernel: #1 0xffffffff80a6bb36 at
vpanic+0x186
>>>>> Jun? 8 11:48:43 zfs kernel: #2 0xffffffff80a6b9a3 at
panic+0x43
>>>>> Jun? 8 11:48:43 zfs kernel: #3 0xffffffff82488192 at
vcmn_err+0xc2
>>>>> Jun? 8 11:48:43 zfs kernel: #4 0xffffffff821f73ba at
zfs_panic_recover+0x5a
>>>>> Jun? 8 11:48:43 zfs kernel: #5 0xffffffff821dff8f at
range_tree_add+0x20f
>>>>> Jun? 8 11:48:43 zfs kernel: #6 0xffffffff821deb06 at
metaslab_free_dva+0x276
>>>>> Jun? 8 11:48:43 zfs kernel: #7 0xffffffff821debc1 at
metaslab_free+0x91
>>>>> Jun? 8 11:48:43 zfs kernel: #8 0xffffffff8222296a at
zio_dva_free+0x1a
>>>>> Jun? 8 11:48:43 zfs kernel: #9 0xffffffff8221f6cc at
zio_execute+0xac
>>>>> Jun? 8 11:48:43 zfs kernel: #10 0xffffffff80abe827 at
>>>>> taskqueue_run_locked+0x127
>>>>> Jun? 8 11:48:43 zfs kernel: #11 0xffffffff80abf9c8 at
>>>>> taskqueue_thread_loop+0xc8
>>>>> Jun? 8 11:48:43 zfs kernel: #12 0xffffffff80a2f7d5 at
fork_exit+0x85
>>>>> Jun? 8 11:48:43 zfs kernel: #13 0xffffffff80ec4abe at
fork_trampoline+0xe
>>>>> Jun? 8 11:48:43 zfs kernel: Uptime: 9m7s
>>>>>
>>>>> Maybe a known bug?
>>>>> Is there anything I can do about this?
>>>>> Any debugging needed?
>>>>
>>>> Sorry to inform you but your on-disk data got corrupted.
>>>> The most straightforward thing you can do is try to save data
from the pool in
>>>> readonly mode.
>>>
>>> Hi Andriy,
>>>
>>> Auch, that is a first in 12 years of using ZFS.
"Fortunately" it was of a test
>>> ZVOL->iSCSI->Win10 disk on which I spool my CAMs.
>>>
>>> Removing the ZVOL actually fixed the rebooting, but now the
question is:
>>> ????Is the remainder of the zpools on the same disks in danger?
>>
>> You can try to check with zdb -b on an idle (better exported) pool. 
And zpool
>> scrub.
> 
> If scrub says things are oke, I can start breathing again?
> exporting the pool is something for the small hours.
> 
> Thanx,
> --WjW
> 
> 
> _______________________________________________
> freebsd-stable at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscribe at
freebsd.org"
> 
-- 
Stefan Wendler
stefan.wendler at tngtech.com
+49 (0) 176 -  2438 3835
Senior Consultant

TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterf?hring
Gesch?ftsf?hrer: Henrik Klagges, Dr. Robert Dahlke, Gerhard M?ller
Sitz: Unterf?hring * Amtsgericht M?nchen * HRB 135082

Willem Jan Withagen

2018-Jun-11 12:48 UTC

head link

Continuous crashing ZFS server

On 11-6-2018 14:35, Stefan Wendler wrote:> Do you use L2ARC/ZIL disks? I had a similar problem that turned out to
> be a broken caching SSD. Scrubbing didn't help a bit because it
reported
> that data was okay. And SMART was fine as well. Fortunately I could
> still send/recv snapshots to a backup disk but wasn't able to replace
> the SSDs without a pool restore. ZFS just wouldn't sync some older ZIL
> data to disk and also wouldn't release the SSDs from the pool. Did you
> also check the logs for entries that look like broken RAM?
That was one of the things I looked for, bad things in log files.
But the server does not deem to have any hardware problems.

I'll dive a bit deeper into my ZIL SSDs

Thanx,
--WjW
> Cheers,
> Stefan
> 
> On 06/11/2018 01:29 PM, Willem Jan Withagen wrote:
>> On 11-6-2018 12:53, Andriy Gapon wrote:
>>> On 11/06/2018 13:26, Willem Jan Withagen wrote:
>>>> On 11/06/2018 12:13, Andriy Gapon wrote:
>>>>> On 08/06/2018 13:02, Willem Jan Withagen wrote:
>>>>>> My file server is crashing about every 15 minutes at
the moment.
>>>>>> The panic looks like:
>>>>>>
>>>>>> Jun? 8 11:48:43 zfs kernel: panic: Solaris(panic): zfs:
allocating
>>>>>> allocated segment(offset=12922221670400 size=24576)
>>>>>> Jun? 8 11:48:43 zfs kernel:
>>>>>> Jun? 8 11:48:43 zfs kernel: cpuid = 1
>>>>>> Jun? 8 11:48:43 zfs kernel: KDB: stack backtrace:
>>>>>> Jun? 8 11:48:43 zfs kernel: #0 0xffffffff80aada57 at
kdb_backtrace+0x67
>>>>>> Jun? 8 11:48:43 zfs kernel: #1 0xffffffff80a6bb36 at
vpanic+0x186
>>>>>> Jun? 8 11:48:43 zfs kernel: #2 0xffffffff80a6b9a3 at
panic+0x43
>>>>>> Jun? 8 11:48:43 zfs kernel: #3 0xffffffff82488192 at
vcmn_err+0xc2
>>>>>> Jun? 8 11:48:43 zfs kernel: #4 0xffffffff821f73ba at
zfs_panic_recover+0x5a
>>>>>> Jun? 8 11:48:43 zfs kernel: #5 0xffffffff821dff8f at
range_tree_add+0x20f
>>>>>> Jun? 8 11:48:43 zfs kernel: #6 0xffffffff821deb06 at
metaslab_free_dva+0x276
>>>>>> Jun? 8 11:48:43 zfs kernel: #7 0xffffffff821debc1 at
metaslab_free+0x91
>>>>>> Jun? 8 11:48:43 zfs kernel: #8 0xffffffff8222296a at
zio_dva_free+0x1a
>>>>>> Jun? 8 11:48:43 zfs kernel: #9 0xffffffff8221f6cc at
zio_execute+0xac
>>>>>> Jun? 8 11:48:43 zfs kernel: #10 0xffffffff80abe827 at
>>>>>> taskqueue_run_locked+0x127
>>>>>> Jun? 8 11:48:43 zfs kernel: #11 0xffffffff80abf9c8 at
>>>>>> taskqueue_thread_loop+0xc8
>>>>>> Jun? 8 11:48:43 zfs kernel: #12 0xffffffff80a2f7d5 at
fork_exit+0x85
>>>>>> Jun? 8 11:48:43 zfs kernel: #13 0xffffffff80ec4abe at
fork_trampoline+0xe
>>>>>> Jun? 8 11:48:43 zfs kernel: Uptime: 9m7s
>>>>>>
>>>>>> Maybe a known bug?
>>>>>> Is there anything I can do about this?
>>>>>> Any debugging needed?
>>>>>
>>>>> Sorry to inform you but your on-disk data got corrupted.
>>>>> The most straightforward thing you can do is try to save
data from the pool in
>>>>> readonly mode.
>>>>
>>>> Hi Andriy,
>>>>
>>>> Auch, that is a first in 12 years of using ZFS.
"Fortunately" it was of a test
>>>> ZVOL->iSCSI->Win10 disk on which I spool my CAMs.
>>>>
>>>> Removing the ZVOL actually fixed the rebooting, but now the
question is:
>>>> ????Is the remainder of the zpools on the same disks in danger?
>>>
>>> You can try to check with zdb -b on an idle (better exported) pool.
And zpool
>>> scrub.
>>
>> If scrub says things are oke, I can start breathing again?
>> exporting the pool is something for the small hours.
>>
>> Thanx,
>> --WjW
>>
>>
>> _______________________________________________
>> freebsd-stable at freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
>> To unsubscribe, send any mail to "freebsd-stable-unsubscribe at
freebsd.org"
>>
>

freebsd stable - Jun 2018 - Continuous crashing ZFS server

Continuous crashing ZFS server

Continuous crashing ZFS server