Johan Ström
2007-Aug-21 05:32 UTC
Crashed gmirror, single disk marked SYNC and wont boot...
Hi
FreeBSD gw-1.stromnet.se 6.2-RELEASE-p1 FreeBSD 6.2-RELEASE-p1 #7:
Tue Feb 13 18:24:34 CET 2007 johan@elfi.stromnet.se:/usr/obj/usr/
src/sys/ROUTER.POLLING i386
(ROUTER.POLLING is GENERIC + options DEVICE_POLLING and ALTQ,
IPSEC, also pfsync and carp)
This weekend I had a disk failing on me in a machine running gmirror
gm0 with 2 providers (ad0 and ad6). The whole box froze with no
screen output, and on hard reboot I got some LBA errors etc from ad0,
after a few reboots it got up and running though (I wasnt at the
screen, had do do it by phone so couldn't really debug very well).
As soon as the box got up, I removed ad0 from the gmirror, so ad6 was
the only provider. Today I got a new disk that would replace ad0..
Now remeber, ad6 was the only disk in the mirror. I took the box down
fine, replaced the disk. ad0 was now gone and instead I hade ad4 (ad4
+6 is SATA, ad0 was IDE). Changed so I booted of the old SATA..
Okay, there came the first problem; the boot loader gave me the usual
options F1 FreeBSD F5 Disk 2 (or whatever it said).. If I pressed F1
i got the same prompt again.. F5 nothing at all.. Funny!... The
system refused to load the loader (or whatever the 1-9 menu thingy is
called) kernel or anything..
So I finally plugged the old ad0 disk into the machine to at least
get it booted, thinking it would go up on the gmirror.. Nope..:
(got the new ad4 out here)
ad0: 38166MB <WDC WD400BB-00CAA1 17.07W17> at ata0-master UDMA100
ad6: 152627MB <SAMSUNG HD160JJ ZM100-41> at ata3-master SATA150
GEOM_MIRROR: Device gm0 created (id=4029378995).
GEOM_MIRROR: Device gm0: provider ad6 detected.
Root mount waiting for: GMIRROR
Root mount waiting for: GMIRROR
Root mount waiting for: GMIRROR
Root mount waiting for: GMIRROR
GEOM_MIRROR: Force device gm0 start due to timeout.
Trying to mount root from ufs:/dev/mirror/gm0s1a
Manual root filesystem specification:
<fstype>:<device> Mount <device> using filesystem
<fstype>
eg. ufs:da0s1a
? List valid disk boot devices
<empty line> Abort manual input
mountroot>
Okey... so why wouldnt it load my mirror from ad6 now?? I just did a
clean shutdown without problems.. It didnt even recognize any slices
on ad6s1 (altough the ad6s1 was found)...
I entered ad0s1 as root and booted from there, ofcourse i got to
emergency shell since fstab looked for the gmirror devices, which
didnt exist..
Some more digging into gmirror, I did a gmirror dump ad6:
Metadata on /dev/ad6:
magic: GEOM::MIRROR
version: 3
name: gm0
mid: 4029378995
did: 449032193
all: 3
genid: 0
syncid: 5
priority: 0
slice: 4096
balance: round-robin
mediasize: 20416757248
sectorsize: 512
syncoffset: 0
mflags: NONE
dflags: SYNCHRONIZING
hcprovider:
provsize: 160041885696
MD5 hash: 6e1e8ca80a27e0e1b0460feab595c39f
Some googling indicated that SYNCHRONIZING means that its not
"complete" and wont mount? Is that correct? Why would it be in that
state then, I just shut it down fine... And where the f*ck did my
slices go??..
Did a sysctl kern.geom.mirror.debug=2 and tried to gmirror activate
the mirror:
GEOM_MIRROR[1]: Creating device gm0 (id=4029378995).
GEOM_MIRROR[0]: Device gm0 created (id=4029378995).
GEOM_MIRROR[1]: root_mount_hold 0xc3539510
GEOM_MIRROR[1]: Adding disk ad6 to gm0.
GEOM_MIRROR[2]: Adding disk ad6.
GEOM_MIRROR[2]: Disk ad6 connected.
GEOM_MIRROR[1]: Disk ad6 state changed from NONE to NEW (device gm0).
GEOM_MIRROR[0]: Device gm0: provider ad6 detected.
GEOM_MIRROR[2]: Tasting ad6s1.
GEOM_MIRROR[0]: Force device gm0 start due to timeout.
GEOM_MIRROR[1]: root_mount_rel[2169] 0xc3539510
GEOM_MIRROR[2]: No I/O requests for gm0, it can be destroyed.
GEOM_MIRROR[2]: Metadata on ad6 updated.
GEOM_MIRROR[2]: Access ad6 r-1w-1e-1 = 0
GEOM_MIRROR[0]: Device gm0 destroyed.
GEOM_MIRROR[1]: Thread exiting.
GEOM_MIRROR[1]: Consumer ad6 destroyed.
Soo.. What is going on here? Anyone with some clues? Currently
running on the ad0 disk, no raid at all.. Lets hope it doesnt die on
me (havent had any signs of that since sunday when it froze and gave
boot errors now so I'm hoping..). The data loss from using ad0
instead of ad6 is probably minimal, its a router so its more or less
only logging that seems to been lost... For now I just want to get
clear about wth happened here and how to prevent it, and how to get
back up on a gmirror with ad6 and ad4 (to be plugged in) so I can
throw ad0 out...
Thanks
--
Johan Str?m
Stromnet
johan@stromnet.se
http://www.stromnet.se/
Pawel Jakub Dawidek
2007-Aug-21 08:00 UTC
Crashed gmirror, single disk marked SYNC and wont boot...
On Tue, Aug 21, 2007 at 02:15:08PM +0200, Johan Str?m wrote:> Hi > > FreeBSD gw-1.stromnet.se 6.2-RELEASE-p1 FreeBSD 6.2-RELEASE-p1 #7: > Tue Feb 13 18:24:34 CET 2007 johan@elfi.stromnet.se:/usr/obj/usr/ > src/sys/ROUTER.POLLING i386 > > (ROUTER.POLLING is GENERIC + options DEVICE_POLLING and ALTQ, > IPSEC, also pfsync and carp) > > This weekend I had a disk failing on me in a machine running gmirror > gm0 with 2 providers (ad0 and ad6). The whole box froze with no > screen output, and on hard reboot I got some LBA errors etc from ad0, > after a few reboots it got up and running though (I wasnt at the > screen, had do do it by phone so couldn't really debug very well). > As soon as the box got up, I removed ad0 from the gmirror, so ad6 was > the only provider. Today I got a new disk that would replace ad0.. > Now remeber, ad6 was the only disk in the mirror. I took the box down > fine, replaced the disk. ad0 was now gone and instead I hade ad4 (ad4 > +6 is SATA, ad0 was IDE). Changed so I booted of the old SATA.. > Okay, there came the first problem; the boot loader gave me the usual > options F1 FreeBSD F5 Disk 2 (or whatever it said).. If I pressed F1 > i got the same prompt again.. F5 nothing at all.. Funny!... The > system refused to load the loader (or whatever the 1-9 menu thingy is > called) kernel or anything.. > So I finally plugged the old ad0 disk into the machine to at least > get it booted, thinking it would go up on the gmirror.. Nope..: > > (got the new ad4 out here) > ad0: 38166MB <WDC WD400BB-00CAA1 17.07W17> at ata0-master UDMA100 > ad6: 152627MB <SAMSUNG HD160JJ ZM100-41> at ata3-master SATA150 > GEOM_MIRROR: Device gm0 created (id=4029378995). > GEOM_MIRROR: Device gm0: provider ad6 detected. > Root mount waiting for: GMIRROR > Root mount waiting for: GMIRROR > Root mount waiting for: GMIRROR > Root mount waiting for: GMIRROR > GEOM_MIRROR: Force device gm0 start due to timeout. > Trying to mount root from ufs:/dev/mirror/gm0s1a > > Manual root filesystem specification: > <fstype>:<device> Mount <device> using filesystem <fstype> > eg. ufs:da0s1a > ? List valid disk boot devices > <empty line> Abort manual input > > mountroot> > > Okey... so why wouldnt it load my mirror from ad6 now?? I just did a > clean shutdown without problems.. It didnt even recognize any slices > on ad6s1 (altough the ad6s1 was found)...It loaded your mirror just fine, you confuse things. Gmirror started in degraded state, as one could expect, but it seems there is no 'a' partition on your gm0s1 slice (or entire bsdlabel is gone). You could try to recreate it based on bsdlabel from ad0 (if it should be the same), but I've no idea how it disapeared. Anyway, gmirror seems to work properly.> Some more digging into gmirror, I did a gmirror dump ad6: > > Metadata on /dev/ad6: > magic: GEOM::MIRROR > version: 3 > name: gm0 > mid: 4029378995 > did: 449032193 > all: 3You have 3-way mirror?> genid: 0 > syncid: 5 > priority: 0 > slice: 4096 > balance: round-robin > mediasize: 20416757248 > sectorsize: 512 > syncoffset: 0 > mflags: NONE > dflags: SYNCHRONIZING > hcprovider: > provsize: 160041885696 > MD5 hash: 6e1e8ca80a27e0e1b0460feab595c39fBTW. Your provider size is 149GB and mirror only use 19GB, which means you mirrored 149GB disk with 19GB disk and you waste 130GB (it's unused).> Some googling indicated that SYNCHRONIZING means that its not > "complete" and wont mount? Is that correct? Why would it be in that > state then, I just shut it down fine... And where the f*ck did my > slices go??..SYNCHRONIZING means that this component was/is being synchronized. It seems that you removed/lost the master disk, while it was synchronizing. It should work anyway. BTW. You confuse things again. Your slice is just fine (ad6s1), you don't have partitions, AFAIU. All in all, your partition table seems to be gone. If you created it on gmirror before (gm0s1) you may still have the same partition table on the other half of the mirror. You can try to move it to ad6 with bsdlabel and verify if you can see file system inside partitions. -- Pawel Jakub Dawidek http://www.wheel.pl pjd@FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20070821/e56fd009/attachment.pgp