thr3ads.net - freebsd stable - ZFS causing panic [Jul 2012]

If this information is useful, please help other people find it:
Share via:

Clayton Milos

2012-Jul-23 13:10 UTC

ZFS causing panic

Hi guys

I've had an issue for some time now. When I'm copying a lot of files
over to
ZFS usually using SMB it causes a panic and locks up the server.
I'm running FreeBSD 9.0-RELEASE with a custom kernel. I've just pulled
unnecessary drivers out of the config and added:
cpu             HAMMER
device          pf
device          pflog
options         DEVICE_POLLING
options         HZ=1000

For full disclosure I am getting these errors in the syslog which means
there's an ECC error occurring somewhere which I am trying to locate. I have
replaced both of the CPU's and all of the RAM and am still getting it so
perhaps the north bridge has bought the farm. I don't think that this is the
issue though because I was getting panics before on other hardware. Current
hardware is an 80G OS drive, 2x Opteron 285's and 16G (8x2G) of RAM on a
Tyan 2892 motherboard. Raid card is an Areca 1120.
I am running 2 pools. Both of them are 4 drive hardware RAID5. The one I'm
having issues with is 4x3TB drives seen as a 9TB scsi drive:
da0 at arcmsr0 bus 0 scbus6 target 0 lun 0
da0: <Areca HOMER R001> Fixed Direct Access SCSI-5 device 
da0: 166.666MB/s transfers (83.333MHz, offset 32, 16bit)
da0: Command Queueing enabled
da0: 8583068MB (17578123776 512 byte sectors: 255H 63S/T 1094187C)

This is encrypted with GELI to make /dev/da0.eli upon which the pool is
created. It looks like it's lost the pool now since the last panic:
  pool: homer
 state: FAULTED
status: The pool metadata is corrupted and the pool cannot be opened.
action: Destroy and re-create the pool from
        a backup source.
   see: http://www.sun.com/msg/ZFS-8000-72
 scan: scrub repaired 0 in 7h0m with 0 errors on Mon Jul 23 05:25:27 2012
config:

        NAME        STATE     READ WRITE CKSUM
        homer       FAULTED      0     0     2
          da0.eli   ONLINE       0     0     8

Also I was running a script to check the kernel memory every 2 seconds. It
appears that it was well within the 1G I have assigned it in
/boot/loader.conf:
TOTAL=695217852, 663.011 MB
TOTAL=695217852, 663.011 MB
TOTAL=695217852, 663.011 MB
TOTAL=695219900, 663.013 MB
TOTAL=695219900, 663.013 MB
TOTAL=695345852, 663.133 MB
TOTAL=695412412, 663.197 MB
TOTAL=695228092, 663.021 MB
TOTAL=695228092, 663.021 MB
TOTAL=695226044, 663.019 MB

My /boot/loader.conf contains:
ng_bpf_load="YES"
amdtemp_load="YES"
ubsec_load="YES"
vm.kmem_size="1024M"
vm.kmem_size_max="1024M"
vfs.zfs.arc_max="600M"
vfs.zfs.vdev.cache.size="8M"
vfs.zfs.txg.timeout="5"
kern.maxvnodes="250000"

This system is a home server so I can run a debug kernel if required and
crash it again.
My first question is am I doing something wrong? I think the values I've put
in are sufficient but I could well have done it wrong.
The server is also not writing the crash dump out by the looks. It hung on
1% and I had to power cycle it.
This is the panic:
panic: solaris assert: 0 == zap_increment_int(os, (-1ULL), user, delta, tx)
(0x0 == 8x7a), file:
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/dm
u_object.c, line: 1224
cpuid = 3
KDB: stack backtrace
#0 0xffffffff8055b74e at kdb_backtrace+0x5e
#1 0xffffffff80525c47 at panic+0x187
#2 0xffffffff80e71b9d at do_userquota_update+0xad
#3 0xffffffff80e71dae at dmu_objset_do_userquota_updates+0x1de
#4 0xffffffff80e882af at dso_pool_sync+0x11f
#5 0xffffffff80e976e4 at spa_sunc+0x334
#6 0xffffffff80ea7ed3 at txg_sync_thread+0x253
#7 0xffffffff804f89ee at fork_exit+0x11e
#8 0xffffffff8075847e at fork_trampoline+0xe
Uptime: 14h31m10s
Dumping 2489 out of 16370 MB:..1%

Thanks for any help.
//Clay

Steven Hartland

2012-Jul-23 13:26 UTC

head link

ZFS causing panic

----- Original Message ----- 
From: "Clayton Milos" <clay@milos.co.za>

> Hi guys
> 
> I've had an issue for some time now. When I'm copying a lot of
files over to
> ZFS usually using SMB it causes a panic and locks up the server.
> I'm running FreeBSD 9.0-RELEASE with a custom kernel. I've just
pulled
> unnecessary drivers out of the config and added:
> cpu             HAMMER
> device          pf
> device          pflog
> options         DEVICE_POLLING
> options         HZ=1000
> 
> For full disclosure I am getting these errors in the syslog which means
> there's an ECC error occurring somewhere which I am trying to locate. I
have
> replaced both of the CPU's and all of the RAM and am still getting it
so
> perhaps the north bridge has bought the farm.
We have some similar HW here and we suspect either CPU or Northbridge too,
we where seeing day to day panics and a scrub would pretty much guarantee
a panic.

We also replaced the CPU's with no joy but found disabling the cores of
the CPU in the second socket made the issues go away so strengthening the
Northbridge theory.

Try disabling the cores with the following and see if it helps:-
/boot/loader.conf
hint.lapic.1.disabled="1"
hint.lapic.2.disabled="1"

Unfortunately if you have perceived corruption due to this type of issue
there's no guaranteeing what state your data is really in :(

    Regards
    Steve

===============================================This e.mail is private and
confidential between Multiplay (UK) Ltd. and the person or entity to whom it is
addressed. In the event of misdirection, the recipient is prohibited from using,
copying, printing or otherwise disseminating it or any information contained in
it.

In the event of misdirection, illegible or incomplete transmission please
telephone +44 845 868 1337
or return the E.mail to postmaster@multiplay.co.uk.

freebsd stable - Jul 2012 - ZFS causing panic

ZFS causing panic

ZFS causing panic