Johan Ström
2007-Aug-21 05:32 UTC
Crashed gmirror, single disk marked SYNC and wont boot...
Hi FreeBSD gw-1.stromnet.se 6.2-RELEASE-p1 FreeBSD 6.2-RELEASE-p1 #7: Tue Feb 13 18:24:34 CET 2007 johan@elfi.stromnet.se:/usr/obj/usr/ src/sys/ROUTER.POLLING i386 (ROUTER.POLLING is GENERIC + options DEVICE_POLLING and ALTQ, IPSEC, also pfsync and carp) This weekend I had a disk failing on me in a machine running gmirror gm0 with 2 providers (ad0 and ad6). The whole box froze with no screen output, and on hard reboot I got some LBA errors etc from ad0, after a few reboots it got up and running though (I wasnt at the screen, had do do it by phone so couldn't really debug very well). As soon as the box got up, I removed ad0 from the gmirror, so ad6 was the only provider. Today I got a new disk that would replace ad0.. Now remeber, ad6 was the only disk in the mirror. I took the box down fine, replaced the disk. ad0 was now gone and instead I hade ad4 (ad4 +6 is SATA, ad0 was IDE). Changed so I booted of the old SATA.. Okay, there came the first problem; the boot loader gave me the usual options F1 FreeBSD F5 Disk 2 (or whatever it said).. If I pressed F1 i got the same prompt again.. F5 nothing at all.. Funny!... The system refused to load the loader (or whatever the 1-9 menu thingy is called) kernel or anything.. So I finally plugged the old ad0 disk into the machine to at least get it booted, thinking it would go up on the gmirror.. Nope..: (got the new ad4 out here) ad0: 38166MB <WDC WD400BB-00CAA1 17.07W17> at ata0-master UDMA100 ad6: 152627MB <SAMSUNG HD160JJ ZM100-41> at ata3-master SATA150 GEOM_MIRROR: Device gm0 created (id=4029378995). GEOM_MIRROR: Device gm0: provider ad6 detected. Root mount waiting for: GMIRROR Root mount waiting for: GMIRROR Root mount waiting for: GMIRROR Root mount waiting for: GMIRROR GEOM_MIRROR: Force device gm0 start due to timeout. Trying to mount root from ufs:/dev/mirror/gm0s1a Manual root filesystem specification: <fstype>:<device> Mount <device> using filesystem <fstype> eg. ufs:da0s1a ? List valid disk boot devices <empty line> Abort manual input mountroot> Okey... so why wouldnt it load my mirror from ad6 now?? I just did a clean shutdown without problems.. It didnt even recognize any slices on ad6s1 (altough the ad6s1 was found)... I entered ad0s1 as root and booted from there, ofcourse i got to emergency shell since fstab looked for the gmirror devices, which didnt exist.. Some more digging into gmirror, I did a gmirror dump ad6: Metadata on /dev/ad6: magic: GEOM::MIRROR version: 3 name: gm0 mid: 4029378995 did: 449032193 all: 3 genid: 0 syncid: 5 priority: 0 slice: 4096 balance: round-robin mediasize: 20416757248 sectorsize: 512 syncoffset: 0 mflags: NONE dflags: SYNCHRONIZING hcprovider: provsize: 160041885696 MD5 hash: 6e1e8ca80a27e0e1b0460feab595c39f Some googling indicated that SYNCHRONIZING means that its not "complete" and wont mount? Is that correct? Why would it be in that state then, I just shut it down fine... And where the f*ck did my slices go??.. Did a sysctl kern.geom.mirror.debug=2 and tried to gmirror activate the mirror: GEOM_MIRROR[1]: Creating device gm0 (id=4029378995). GEOM_MIRROR[0]: Device gm0 created (id=4029378995). GEOM_MIRROR[1]: root_mount_hold 0xc3539510 GEOM_MIRROR[1]: Adding disk ad6 to gm0. GEOM_MIRROR[2]: Adding disk ad6. GEOM_MIRROR[2]: Disk ad6 connected. GEOM_MIRROR[1]: Disk ad6 state changed from NONE to NEW (device gm0). GEOM_MIRROR[0]: Device gm0: provider ad6 detected. GEOM_MIRROR[2]: Tasting ad6s1. GEOM_MIRROR[0]: Force device gm0 start due to timeout. GEOM_MIRROR[1]: root_mount_rel[2169] 0xc3539510 GEOM_MIRROR[2]: No I/O requests for gm0, it can be destroyed. GEOM_MIRROR[2]: Metadata on ad6 updated. GEOM_MIRROR[2]: Access ad6 r-1w-1e-1 = 0 GEOM_MIRROR[0]: Device gm0 destroyed. GEOM_MIRROR[1]: Thread exiting. GEOM_MIRROR[1]: Consumer ad6 destroyed. Soo.. What is going on here? Anyone with some clues? Currently running on the ad0 disk, no raid at all.. Lets hope it doesnt die on me (havent had any signs of that since sunday when it froze and gave boot errors now so I'm hoping..). The data loss from using ad0 instead of ad6 is probably minimal, its a router so its more or less only logging that seems to been lost... For now I just want to get clear about wth happened here and how to prevent it, and how to get back up on a gmirror with ad6 and ad4 (to be plugged in) so I can throw ad0 out... Thanks -- Johan Str?m Stromnet johan@stromnet.se http://www.stromnet.se/
Pawel Jakub Dawidek
2007-Aug-21 08:00 UTC
Crashed gmirror, single disk marked SYNC and wont boot...
On Tue, Aug 21, 2007 at 02:15:08PM +0200, Johan Str?m wrote:> Hi > > FreeBSD gw-1.stromnet.se 6.2-RELEASE-p1 FreeBSD 6.2-RELEASE-p1 #7: > Tue Feb 13 18:24:34 CET 2007 johan@elfi.stromnet.se:/usr/obj/usr/ > src/sys/ROUTER.POLLING i386 > > (ROUTER.POLLING is GENERIC + options DEVICE_POLLING and ALTQ, > IPSEC, also pfsync and carp) > > This weekend I had a disk failing on me in a machine running gmirror > gm0 with 2 providers (ad0 and ad6). The whole box froze with no > screen output, and on hard reboot I got some LBA errors etc from ad0, > after a few reboots it got up and running though (I wasnt at the > screen, had do do it by phone so couldn't really debug very well). > As soon as the box got up, I removed ad0 from the gmirror, so ad6 was > the only provider. Today I got a new disk that would replace ad0.. > Now remeber, ad6 was the only disk in the mirror. I took the box down > fine, replaced the disk. ad0 was now gone and instead I hade ad4 (ad4 > +6 is SATA, ad0 was IDE). Changed so I booted of the old SATA.. > Okay, there came the first problem; the boot loader gave me the usual > options F1 FreeBSD F5 Disk 2 (or whatever it said).. If I pressed F1 > i got the same prompt again.. F5 nothing at all.. Funny!... The > system refused to load the loader (or whatever the 1-9 menu thingy is > called) kernel or anything.. > So I finally plugged the old ad0 disk into the machine to at least > get it booted, thinking it would go up on the gmirror.. Nope..: > > (got the new ad4 out here) > ad0: 38166MB <WDC WD400BB-00CAA1 17.07W17> at ata0-master UDMA100 > ad6: 152627MB <SAMSUNG HD160JJ ZM100-41> at ata3-master SATA150 > GEOM_MIRROR: Device gm0 created (id=4029378995). > GEOM_MIRROR: Device gm0: provider ad6 detected. > Root mount waiting for: GMIRROR > Root mount waiting for: GMIRROR > Root mount waiting for: GMIRROR > Root mount waiting for: GMIRROR > GEOM_MIRROR: Force device gm0 start due to timeout. > Trying to mount root from ufs:/dev/mirror/gm0s1a > > Manual root filesystem specification: > <fstype>:<device> Mount <device> using filesystem <fstype> > eg. ufs:da0s1a > ? List valid disk boot devices > <empty line> Abort manual input > > mountroot> > > Okey... so why wouldnt it load my mirror from ad6 now?? I just did a > clean shutdown without problems.. It didnt even recognize any slices > on ad6s1 (altough the ad6s1 was found)...It loaded your mirror just fine, you confuse things. Gmirror started in degraded state, as one could expect, but it seems there is no 'a' partition on your gm0s1 slice (or entire bsdlabel is gone). You could try to recreate it based on bsdlabel from ad0 (if it should be the same), but I've no idea how it disapeared. Anyway, gmirror seems to work properly.> Some more digging into gmirror, I did a gmirror dump ad6: > > Metadata on /dev/ad6: > magic: GEOM::MIRROR > version: 3 > name: gm0 > mid: 4029378995 > did: 449032193 > all: 3You have 3-way mirror?> genid: 0 > syncid: 5 > priority: 0 > slice: 4096 > balance: round-robin > mediasize: 20416757248 > sectorsize: 512 > syncoffset: 0 > mflags: NONE > dflags: SYNCHRONIZING > hcprovider: > provsize: 160041885696 > MD5 hash: 6e1e8ca80a27e0e1b0460feab595c39fBTW. Your provider size is 149GB and mirror only use 19GB, which means you mirrored 149GB disk with 19GB disk and you waste 130GB (it's unused).> Some googling indicated that SYNCHRONIZING means that its not > "complete" and wont mount? Is that correct? Why would it be in that > state then, I just shut it down fine... And where the f*ck did my > slices go??..SYNCHRONIZING means that this component was/is being synchronized. It seems that you removed/lost the master disk, while it was synchronizing. It should work anyway. BTW. You confuse things again. Your slice is just fine (ad6s1), you don't have partitions, AFAIU. All in all, your partition table seems to be gone. If you created it on gmirror before (gm0s1) you may still have the same partition table on the other half of the mirror. You can try to move it to ad6 with bsdlabel and verify if you can see file system inside partitions. -- Pawel Jakub Dawidek http://www.wheel.pl pjd@FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20070821/e56fd009/attachment.pgp