Hello
I got a few boxes, elfi crus and gw-1, running gmirror. These are
three completely different boxes, but all are running 6.1. They all
have multiple disks which are gmirrored, two of them SATA-only and
one has a mirror between one SATA and one ATA.
Some times now and then they all have different problems with the
mirrors.. All three in different ways.. although elfi being the one
crashing most, its also the one with most disk IO so that might be
"expected" (not that it crashes but that its the one crashing most
often)..
First, some HW spec:
elfi:
FreeBSD elfi.stromnet.se 6.2-RELEASE FreeBSD 6.2-RELEASE #9: Thu Jan
18 16:53:20 CET 2007 root@:/usr/obj/usr/src/sys/ELFI i386
atapci1: <nVidia nForce3 Pro SATA150 controller> port
0x9f0-0x9f7,0xbf0-0xbf3,0x970-0x977,0xb70-0xb73,0xdc00-0xdc0f,
0xe000-0xe07f irq 21 at device 10.0 on pci0
ad4: 286187MB <Maxtor 7L300S0 BANC1G10> at ata2-master SATA150
ad6: 286187MB <Maxtor 7L300S0 BANC1G10> at ata3-master SATA150
Mirror gm0s1 consist of ad4+ad6
crus:
FreeBSD crus.stromnet.org 6.1-RELEASE FreeBSD 6.1-RELEASE #3: Tue
May 9 20:40:23 CEST 2006 johan@elfi.stromnet.org:/usr/obj/usr/
src/sys/GENERIC i386
atapci1: <Promise PDC40518 SATA150 controller> port 0x7480-0x74ff,
0x7800-0x78ff mem 0xfebdb000-0xfebdbfff,0xfebe0000-0xfebfffff irq 22
at device 14.0 on pci1
ad8: 305245MB <Seagate ST3320620AS 3.AAE> at ata4-master SATA150
ad12: 305245MB <Seagate ST3320620AS 3.AAE> at ata6-master SATA150
Mirror gm1 consists of ad8+ad12
gw-1:
FreeBSD gw-1.stromnet.se 6.2-RELEASE-p1 FreeBSD 6.2-RELEASE-p1 #7:
Tue Feb 13 18:24:34 CET 2007 johan@elfi.stromnet.se:/usr/obj/usr/
src/sys/ROUTER.POLLING i386
atapci0: <nVidia nForce2 Pro UDMA133 controller> port
0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xffa0-0xffaf at device 9.0 on pci0
atapci1: <nVidia nForce2 Pro SATA150 controller> port
0xec00-0xec07,0xe880-0xe883,0xe800-0xe807,0xe480-0xe483,0x7f00-0x7f0f,
0x7c00-0x7c7f irq 20 at device 11.
ad2: 38166MB <WDC WD400BB-00CAA1 17.07W17> at ata1-master UDMA100
ad6: 152627MB <SAMSUNG HD160JJ ZM100-41> at ata3-master SATA150
Mirror gm0 consists of ad6s1+ad2
A typical crash on elfi looks like this:
Apr 24 05:20:27 elfi kernel: ad6: FAILURE - device detached
Apr 24 05:20:27 elfi kernel: subdisk6: detached
Apr 24 05:20:27 elfi kernel: ad6: detached
Apr 24 05:20:27 elfi kernel: GEOM_MIRROR: Device gm0s1: provider ad6
disconnected.
Apr 24 05:20:27 elfi kernel: g_vfs_done():mirror/gm0s1f[READ
(offset=16972791808, length=16384)]error = 6
This can happen any time of the day, this one was from ~5 in the
morning. To recover from this I have to reboot (soft reboot works)
the box and then it will rebuild when booted. atacontrol cannot find
the disk at all before rebooting. I've tried reinit and detach/attach
but no help.
A crash on crus can look like this:
Apr 23 13:45:49 crus kernel: ad8: TIMEOUT - READ_DMA48 retrying (1
retry left) LBA=566657039
Apr 23 13:46:14 crus kernel: ad8: WARNING - READ_DMA48 UDMA ICRC
error (retrying request) LBA=566657039
Apr 23 13:46:14 crus kernel: ad8: WARNING - SETFEATURES SET TRANSFER
MODE taskqueue timeout - completing request directly
Apr 23 13:46:14 crus kernel: ad8: WARNING - SETFEATURES SET TRANSFER
MODE taskqueue timeout - completing request directly
Apr 23 13:46:14 crus kernel: ad8: WARNING - SETFEATURES ENABLE RCACHE
taskqueue timeout - completing request directly
Apr 23 13:46:14 crus kernel: ad8: WARNING - SETFEATURES ENABLE WCACHE
taskqueue timeout - completing request directly
Apr 23 13:46:14 crus kernel: ad8: WARNING - SET_MULTI taskqueue
timeout - completing request directly
Apr 23 13:46:14 crus kernel: ad8: FAILURE - READ_DMA48 timed out
LBA=566657039
Apr 23 13:46:14 crus kernel: GEOM_MIRROR: Request failed (error=5).
ad8[READ(offset=290128403968, length=16384)]
Apr 23 13:46:14 crus kernel: GEOM_MIRROR: Device gm1: provider ad8
disconnected.
This box can do with a gmirror forget followed by a gmirror insert
and it will happily rebuild the array.
The worst box is gw-1:
Apr 20 03:10:59 gw-1 kernel: ad2: timeout waiting to issue command
Apr 20 03:10:59 gw-1 kernel: ad2: error issuing WRITE_DMA command
Apr 20 03:10:59 gw-1 kernel: GEOM_MIRROR: Request failed (error=5).
ad2[WRITE(offset=37578448384, length=16384)]
Apr 20 03:10:59 gw-1 kernel: GEOM_MIRROR: Device gm0: provider ad2
disconnected.
Apr 20 07:23:57 gw-1 syslogd: kernel boot file is /boot/kernel/kernel
Apr 20 07:23:57 gw-1 kernel: Copyright (c) 1992-2007 The FreeBSD
Project.
Yes.. it fails and then the whole box totally HANGS... No input
possible at all.. had to hard-reboot it with the button... Not good
at all.. I have been running the disks that are now in elfi in this
machine before, and at that time I had the same problem.. disk
problems -> total hang.. That was with sata only, this appears to be
a problem with the ATA disk too?..
I have never succeeded to force these crashes.. they appear now and
then but I can never produce them on demand.. The crashes happens now
and then, no regular intervals though.. For elfi:
Apr 24 05:20:27 elfi kernel: GEOM_MIRROR: Device gm0s1: provider ad6
disconnected.
(I actually cant find any other entry in the logs, but judging from
IRC logs: march 28, march 12, feb 13, jan 22, jan 18)
For crus:
Apr 23 13:46:14 crus kernel: GEOM_MIRROR: Device gm1: provider ad8
disconnected.
Apr 13 09:57:49 crus kernel: GEOM_MIRROR: Device gm1: provider ad8
disconnected.
I think it has happened once more, but thats it..
For gw-1 it's luckily only once so far.. At least with the current
install, it has had problems when the maxtor disks was running in it
(and i think it was 6.0 back then)
So.. Three different boxes, with three different chipsets... With
three different crash scenarios.. But they all have problems.. So
where is the actual problem? The HW? The chipset drivers? Gmirror
code? I have run SMART tests on the crashing disks, no errors.. I
have run powermax (maxtors own test program) a while back on the
maxtor disks, no problems.. I have tried changing SATA cables on some
of the disks, no difference..
Does anyone have any clue about what can be causing this? What is
most likely? How do we hunt this down?
Thank you.
Johan Str?m
Stromnet
johan@stromnet.se
http://www.stromnet.se/