Hi guys
I've had an issue for some time now. When I'm copying a lot of files
over to
ZFS usually using SMB it causes a panic and locks up the server.
I'm running FreeBSD 9.0-RELEASE with a custom kernel. I've just pulled
unnecessary drivers out of the config and added:
cpu HAMMER
device pf
device pflog
options DEVICE_POLLING
options HZ=1000
For full disclosure I am getting these errors in the syslog which means
there's an ECC error occurring somewhere which I am trying to locate. I have
replaced both of the CPU's and all of the RAM and am still getting it so
perhaps the north bridge has bought the farm. I don't think that this is the
issue though because I was getting panics before on other hardware. Current
hardware is an 80G OS drive, 2x Opteron 285's and 16G (8x2G) of RAM on a
Tyan 2892 motherboard. Raid card is an Areca 1120.
I am running 2 pools. Both of them are 4 drive hardware RAID5. The one I'm
having issues with is 4x3TB drives seen as a 9TB scsi drive:
da0 at arcmsr0 bus 0 scbus6 target 0 lun 0
da0: <Areca HOMER R001> Fixed Direct Access SCSI-5 device
da0: 166.666MB/s transfers (83.333MHz, offset 32, 16bit)
da0: Command Queueing enabled
da0: 8583068MB (17578123776 512 byte sectors: 255H 63S/T 1094187C)
This is encrypted with GELI to make /dev/da0.eli upon which the pool is
created. It looks like it's lost the pool now since the last panic:
pool: homer
state: FAULTED
status: The pool metadata is corrupted and the pool cannot be opened.
action: Destroy and re-create the pool from
a backup source.
see: http://www.sun.com/msg/ZFS-8000-72
scan: scrub repaired 0 in 7h0m with 0 errors on Mon Jul 23 05:25:27 2012
config:
NAME STATE READ WRITE CKSUM
homer FAULTED 0 0 2
da0.eli ONLINE 0 0 8
Also I was running a script to check the kernel memory every 2 seconds. It
appears that it was well within the 1G I have assigned it in
/boot/loader.conf:
TOTAL=695217852, 663.011 MB
TOTAL=695217852, 663.011 MB
TOTAL=695217852, 663.011 MB
TOTAL=695219900, 663.013 MB
TOTAL=695219900, 663.013 MB
TOTAL=695345852, 663.133 MB
TOTAL=695412412, 663.197 MB
TOTAL=695228092, 663.021 MB
TOTAL=695228092, 663.021 MB
TOTAL=695226044, 663.019 MB
My /boot/loader.conf contains:
ng_bpf_load="YES"
amdtemp_load="YES"
ubsec_load="YES"
vm.kmem_size="1024M"
vm.kmem_size_max="1024M"
vfs.zfs.arc_max="600M"
vfs.zfs.vdev.cache.size="8M"
vfs.zfs.txg.timeout="5"
kern.maxvnodes="250000"
This system is a home server so I can run a debug kernel if required and
crash it again.
My first question is am I doing something wrong? I think the values I've put
in are sufficient but I could well have done it wrong.
The server is also not writing the crash dump out by the looks. It hung on
1% and I had to power cycle it.
This is the panic:
panic: solaris assert: 0 == zap_increment_int(os, (-1ULL), user, delta, tx)
(0x0 == 8x7a), file:
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/dm
u_object.c, line: 1224
cpuid = 3
KDB: stack backtrace
#0 0xffffffff8055b74e at kdb_backtrace+0x5e
#1 0xffffffff80525c47 at panic+0x187
#2 0xffffffff80e71b9d at do_userquota_update+0xad
#3 0xffffffff80e71dae at dmu_objset_do_userquota_updates+0x1de
#4 0xffffffff80e882af at dso_pool_sync+0x11f
#5 0xffffffff80e976e4 at spa_sunc+0x334
#6 0xffffffff80ea7ed3 at txg_sync_thread+0x253
#7 0xffffffff804f89ee at fork_exit+0x11e
#8 0xffffffff8075847e at fork_trampoline+0xe
Uptime: 14h31m10s
Dumping 2489 out of 16370 MB:..1%
Thanks for any help.
//Clay