Hi,
This is strange. gmirror just detached one of its disks
for no apparent reason. I've built a mirror consisting of
the components ad0 and ad1 (both SATA drives). It has
been running fine. This is RELENG_6 from 2006-12-20.
Yesterday evening ad1 was detached. There is no other
error message logged on console or in the logs (i.e. no
I/O error such as a bad sector or anything). There was
no particularly high load at that time. In fact, the
machine had been under much higher load before, without
anything bad happening.
This is from the logs:
Jan 29 19:10:13 pluto -- MARK --
Jan 29 19:20:26 pluto kernel: ad1: FAILURE - device detached
Jan 29 19:20:26 pluto kernel: subdisk1: detached
Jan 29 19:20:26 pluto kernel: ad1: detached
Jan 29 19:20:26 pluto kernel: GEOM_MIRROR: Cannot write metadata on ad1
(device=gm0, error=6).
Jan 29 19:20:26 pluto kernel: GEOM_MIRROR: Cannot update metadata on disk ad1
(error=6).
Jan 29 19:20:26 pluto kernel: GEOM_MIRROR: Cannot update metadata on disk ad1
(error=6).
Jan 29 19:20:26 pluto kernel: GEOM_MIRROR: Device gm0: provider ad1
disconnected.
Jan 29 19:50:13 pluto -- MARK --
This almost looks like typical Windows problems: Something
reports a "failure", but no reason or any other useful
information. :-(
"atacontrol list" reports for ad1::
Master: no device present
After an atacontrol detach/attach cycle, the device is back
again:
Master: ad1 <SAMSUNG HD160JJ/WU100-41> Serial ATA II
I inserted it back into the gmirror, and right now it's
synchronizing happily.
Can anybody please explain what happened, and -- more
importantly -- how to avoid it in the future? As far as
I can tell, the disk drives are perfectly OK.
Best regards
Oliver
PS: disk-related stuff from dmesg:
atapci0: <VIA 6420 SATA150 controller> port
0xe100-0xe107,0xe200-0xe203,0xe300-0xe307,0xe400-0xe403,0xe500-0xe50f,0xe600-0xe6ff
irq 20 at device 15.0 on pci0
ata2: <ATA channel 0> on atapci0
ata3: <ATA channel 1> on atapci0
atapci1: <VIA 8237 UDMA133 controller> port
0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xe700-0xe70f at device 15.1 on pci0
ata0: <ATA channel 0> on atapci1
ata1: <ATA channel 1> on atapci1
ad0: 152627MB <SAMSUNG HD160JJ WU100-41> at ata2-master SATA150
ad1: 152627MB <SAMSUNG HD160JJ WU100-41> at ata3-master SATA150
The PATA controller (ata[01] on atapci1) is not used.
I have disabled ATA_STATIC_ID, so the disks are named
ad0 and ad1. I've also atapicam in the kernel, but
it's not actually used and shouldn't make a difference.
This is the SATA-related info from pciconf -lv:
atapci0@pci0:15:0: class=0x010400 card=0x70941462 chip=0x31491106 rev=0x80
hdr=0x00
vendor = 'VIA Technologies Inc'
device = 'VT8237 VT6410 SATA RAID Controller'
class = mass storage
subclass = RAID
PPS: By the way, what's the best mailing list for ata-
related problems? There's no freebsd-ata@freebsd.org ...
--
Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing b. M.
Handelsregister: Registergericht Muenchen, HRA 74606, USt-Id: DE204219783
Any opinions expressed in this message are personal to the author and may
not necessarily reflect the opinions of secnetix GmbH & Co KG in any way.
FreeBSD-Dienstleistungen, -Produkte und mehr: http://www.secnetix.de/bsd
"C++ is the only current language making COBOL look good."
-- Bertrand Meyer
Hi, At 08:54 30/01/2007, Oliver Fromme wrote:>Hi, > >This is strange. gmirror just detached one of its disks >for no apparent reason. [etc]I've seen similar symptoms on different hardware with ataraid. I suspect that SATA disks occasionaly fail to make their bus timings and (some?) controllers are completely intolerant. -- Bob Bishop +44 (0)118 940 1243 rb@gid.co.uk fax +44 (0)118 940 1295
On Jan 30, 2007, at 9:54, Oliver Fromme wrote:> Hi, > > This is strange. gmirror just detached one of its disks > for no apparent reason. I've built a mirror consisting of > the components ad0 and ad1 (both SATA drives). It has > been running fine. This is RELENG_6 from 2006-12-20. > > Yesterday evening ad1 was detached. There is no other > error message logged on console or in the logs (i.e. no > I/O error such as a bad sector or anything). There was > no particularly high load at that time. In fact, the > machine had been under much higher load before, without > anything bad happening.I had unexplainable intermittent detaches until I replaced one of my memory modules. Never happened since. Admittedly I also had problems completing buildworld - that's why I checked my memory modules in the first place. -- Alban Hertroys "This person has performed an illegal operation, and will be shot down." !DSPAM:74,45bf76649345499641489!
Am Dienstag, 30. Januar 2007 09:54 schrieb Oliver Fromme:> Hi, > > This is strange. gmirror just detached one of its disks > for no apparent reason. I've built a mirror consisting of > the components ad0 and ad1 (both SATA drives). It has > been running fine. This is RELENG_6 from 2006-12-20. > > Yesterday evening ad1 was detached. There is no other > error message logged on console or in the logs (i.e. no > I/O error such as a bad sector or anything). There was > no particularly high load at that time. In fact, the > machine had been under much higher load before, without > anything bad happening. > > This is from the logs: > > Jan 29 19:10:13 pluto -- MARK -- > Jan 29 19:20:26 pluto kernel: ad1: FAILURE - device detached > Jan 29 19:20:26 pluto kernel: subdisk1: detached > Jan 29 19:20:26 pluto kernel: ad1: detached > Jan 29 19:20:26 pluto kernel: GEOM_MIRROR: Cannot write metadata on ad1 > (device=gm0, error=6). Jan 29 19:20:26 pluto kernel: GEOM_MIRROR: Cannot > update metadata on disk ad1 (error=6). Jan 29 19:20:26 pluto kernel: > GEOM_MIRROR: Cannot update metadata on disk ad1 (error=6). Jan 29 19:20:26 > pluto kernel: GEOM_MIRROR: Device gm0: provider ad1 disconnected. Jan 29 > 19:50:13 pluto -- MARK -- > > This almost looks like typical Windows problems: Something > reports a "failure", but no reason or any other useful > information. :-( > > "atacontrol list" reports for ad1:: > > Master: no device present > > After an atacontrol detach/attach cycle, the device is back > again: > > Master: ad1 <SAMSUNG HD160JJ/WU100-41> Serial ATA II > > I inserted it back into the gmirror, and right now it's > synchronizing happily. > > Can anybody please explain what happened, and -- more > importantly -- how to avoid it in the future? As far as > I can tell, the disk drives are perfectly OK.I think this is a problem when the internal thermal recalibration takes too long. Consumer HDDs can be "offline" quiet some time, I don't have numbers handy, but see Western Digitals explanation on their SATA RE (RaidEdition) Drives. Again, no link handy, sorry. -Harry