David Dyer-Bennet
2009-Feb-01 04:51 UTC
[zfs-discuss] Zpool corrupt, crashes system when imported -- one last try
As I''ve said, I''ve got this ZFS pool, with 450GB of my data in it, that crashes the system when I import it. I''ve now got a dozen or so log entries on this, having finally gotten it to happen in a controlled enough environment that I can get at the log files and transfer them somewhere I can access them. They all appear to be exactly the same thing. This crash happens nearly instantly after I do a "zpool import -f zp1". Since the stack traceback suggests it''s involved in a scrub, and indeed I do think it was scrubbing when this first started happening, I''ve tried doing "zpool import -f zp1; zpool scrub -s zp1" in hopes that it will get in in time to stop the hypothetical scrub. I''m running OpenSolaris 2008.11. I''d like to rescue this pool. If I can''t, though, I''ve got to destroy it and restore from backup (which shouldn''t make things much worse than they already are; my backups are in decent shape so far as I know). And I''ve spent far far too much time on this (busy at work, other things at home, so this has dragged on interminably). And I''d love to provide useful information to anybody interested in finding and fixing whatever is wrong in the code that left me in this position, of course. Nobody has responded to my image of the stack traceback from yesterday, but I''m hoping that now that I''ve managed to get more information (including the stuff before the stack traceback), somebody may be able to do something with it. I do also seem to have a dump, if that''s any use to anybody: Jan 31 22:34:11 fsfs genunix: [ID 111219 kern.notice] dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel Jan 31 22:34:21 fsfs genunix: [ID 409368 kern.notice] ^M100% done: 197356 pages dumped, compression ratio 3.20, Jan 31 22:34:21 fsfs genunix: [ID 851671 kern.notice] dump succeeded Here''s how the pool shows before I try to import it: localddb at fsfs:~$ pfexec zpool import pool: zp1 id: 4278074723887284013 state: ONLINE action: The pool can be imported using its name or numeric identifier. config: zp1 ONLINE mirror ONLINE c5t0d0 ONLINE c5t1d0 ONLINE mirror ONLINE c6t0d0 ONLINE c6t1d0 ONLINE localddb at fsfs:~$ Should that report say if a scrub was in progress? And here''s the crash that happens immediately when I do import it: Jan 31 22:34:10 fsfs unix: [ID 836849 kern.notice] Jan 31 22:34:10 fsfs ^Mpanic[cpu1]/thread=ffffff00045fdc80: Jan 31 22:34:10 fsfs genunix: [ID 335743 kern.notice] BAD TRAP: type=e (#pf Page fault) rp=ffffff00045fcd60 addr=4e8 occurred in module "unix" due to a NULL pointer dereference Jan 31 22:34:10 fsfs unix: [ID 100000 kern.notice] Jan 31 22:34:10 fsfs unix: [ID 839527 kern.notice] sched: Jan 31 22:34:10 fsfs unix: [ID 753105 kern.notice] #pf Page fault Jan 31 22:34:10 fsfs unix: [ID 532287 kern.notice] Bad kernel fault at addr=0x4e8 Jan 31 22:34:10 fsfs unix: [ID 243837 kern.notice] pid=0, pc=0xfffffffffb84e84b, sp=0xffffff00045fce58, eflags=0x10246 Jan 31 22:34:10 fsfs unix: [ID 211416 kern.notice] cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 6f8<xmme,fxsr,pge,mce,pae,pse,de> Jan 31 22:34:10 fsfs unix: [ID 624947 kern.notice] cr2: 4e8 Jan 31 22:34:10 fsfs unix: [ID 625075 kern.notice] cr3: 3400000 Jan 31 22:34:10 fsfs unix: [ID 625715 kern.notice] cr8: c Jan 31 22:34:10 fsfs unix: [ID 100000 kern.notice] Jan 31 22:34:10 fsfs unix: [ID 592667 kern.notice] rdi: 4e8 rsi: a200 rdx: ffffff00045fdc80 Jan 31 22:34:10 fsfs unix: [ID 592667 kern.notice] rcx: 1 r8: ffffff01599c8540 r9: 0 Jan 31 22:34:10 fsfs unix: [ID 592667 kern.notice] rax: 0 rbx: 0 rbp: ffffff00045fced0 Jan 31 22:34:10 fsfs unix: [ID 592667 kern.notice] r10: 5b40 r11: 0 r12: ffffff0161e92040 Jan 31 22:34:10 fsfs unix: [ID 592667 kern.notice] r13: ffffff0176684800 r14: ffffff0161e92040 r15: 0 Jan 31 22:34:10 fsfs unix: [ID 592667 kern.notice] fsb: 0 gsb: ffffff0149f68500 ds: 4b Jan 31 22:34:10 fsfs unix: [ID 592667 kern.notice] es: 4b fs: 0 gs: 1c3 Jan 31 22:34:10 fsfs unix: [ID 592667 kern.notice] trp: e err: 2 rip: fffffffffb84e84b Jan 31 22:34:10 fsfs unix: [ID 592667 kern.notice] cs: 30 rfl: 10246 rsp: ffffff00045fce58 Jan 31 22:34:10 fsfs unix: [ID 266532 kern.notice] ss: 38 Jan 31 22:34:10 fsfs unix: [ID 100000 kern.notice] Jan 31 22:34:10 fsfs genunix: [ID 655072 kern.notice] ffffff00045fcc40 unix:die+dd () Jan 31 22:34:10 fsfs genunix: [ID 655072 kern.notice] ffffff00045fcd50 unix:trap+1752 () Jan 31 22:34:10 fsfs genunix: [ID 655072 kern.notice] ffffff00045fcd60 unix:_cmntrap+e9 () Jan 31 22:34:10 fsfs genunix: [ID 655072 kern.notice] ffffff00045fced0 unix:mutex_enter+b () Jan 31 22:34:10 fsfs genunix: [ID 655072 kern.notice] ffffff00045fcfe0 zfs:scrub_visitbp+61b () Jan 31 22:34:10 fsfs genunix: [ID 655072 kern.notice] ffffff00045fd0f0 zfs:scrub_visitbp+5b3 () Jan 31 22:34:10 fsfs genunix: [ID 655072 kern.notice] ffffff00045fd200 zfs:scrub_visitbp+223 () Jan 31 22:34:10 fsfs genunix: [ID 655072 kern.notice] ffffff00045fd310 zfs:scrub_visitbp+282 () Jan 31 22:34:10 fsfs genunix: [ID 655072 kern.notice] ffffff00045fd420 zfs:scrub_visitbp+223 () Jan 31 22:34:10 fsfs genunix: [ID 655072 kern.notice] ffffff00045fd530 zfs:scrub_visitbp+223 () Jan 31 22:34:10 fsfs genunix: [ID 655072 kern.notice] ffffff00045fd640 zfs:scrub_visitbp+223 () Jan 31 22:34:10 fsfs genunix: [ID 655072 kern.notice] ffffff00045fd750 zfs:scrub_visitbp+223 () Jan 31 22:34:10 fsfs genunix: [ID 655072 kern.notice] ffffff00045fd860 zfs:scrub_visitbp+438 () Jan 31 22:34:10 fsfs genunix: [ID 655072 kern.notice] ffffff00045fd8b0 zfs:scrub_visit_rootbp+4f () Jan 31 22:34:10 fsfs genunix: [ID 655072 kern.notice] ffffff00045fd910 zfs:scrub_visitds+7e () Jan 31 22:34:10 fsfs genunix: [ID 655072 kern.notice] ffffff00045fdac0 zfs:dsl_pool_scrub_sync+123 () Jan 31 22:34:10 fsfs genunix: [ID 655072 kern.notice] ffffff00045fdb30 zfs:dsl_pool_sync+18c () Jan 31 22:34:10 fsfs genunix: [ID 655072 kern.notice] ffffff00045fdbc0 zfs:spa_sync+2af () Jan 31 22:34:10 fsfs genunix: [ID 655072 kern.notice] ffffff00045fdc60 zfs:txg_sync_thread+1fc () Jan 31 22:34:10 fsfs genunix: [ID 655072 kern.notice] ffffff00045fdc70 unix:thread_start+8 () Jan 31 22:34:10 fsfs unix: [ID 100000 kern.notice] Jan 31 22:34:10 fsfs genunix: [ID 672855 kern.notice] syncing file systems... Jan 31 22:34:10 fsfs genunix: [ID 904073 kern.notice] done Jan 31 22:34:11 fsfs genunix: [ID 111219 kern.notice] dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel Jan 31 22:34:21 fsfs genunix: [ID 409368 kern.notice] ^M100% done: 197356 pages dumped, compression ratio 3.20, Jan 31 22:34:21 fsfs genunix: [ID 851671 kern.notice] dump succeeded Jan 31 22:34:50 fsfs genunix: [ID 540533 kern.notice] ^MSunOS Release 5.11 Version snv_101b 64-bit Jan 31 22:34:50 fsfs genunix: [ID 172908 kern.notice] Copyright 1983-2008 Sun Microsystems, Inc. All rights reserved. Jan 31 22:34:50 fsfs Use is subject to license terms. Jan 31 22:34:50 fsfs unix: [ID 126719 kern.info] features: 613f6fff<cpuid,tscp,cmp,cx16,sse3,nx,asysc,sse2,sse,pat,cx8,pae,mca,mmx,cmov,de,pge,mtrr,msr,tsc,lgpg> Jan 31 22:34:50 fsfs unix: [ID 168242 kern.info] mem = 2095676K (0x7fe8f000) Jan 31 22:34:50 fsfs unix: [ID 972737 kern.info] Skipping psm: xpv_psm Jan 31 22:34:50 fsfs rootnex: [ID 466748 kern.info] root nexus = i86pc Jan 31 22:34:50 fsfs iommulib: [ID 321598 kern.info] NOTICE: iommulib_nexus_register: rootnex-1: Succesfully registered NEXUS i86pc nexops=fffffffffbceadb0 Jan 31 22:34:50 fsfs rootnex: [ID 349649 kern.info] pseudo0 at root Jan 31 22:34:50 fsfs genunix: [ID 936769 kern.info] pseudo0 is /pseudo Jan 31 22:34:50 fsfs rootnex: [ID 349649 kern.info] scsi_vhci0 at root Jan 31 22:34:50 fsfs genunix: [ID 936769 kern.info] scsi_vhci0 is /scsi_vhci Jan 31 22:34:50 fsfs rootnex: [ID 349649 kern.info] isa0 at root -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
David Dyer-Bennet
2009-Feb-01 18:54 UTC
[zfs-discuss] Zpool corrupt, crashes system when imported -- one last try
Sorry for following up to myself. It was suggested off-list that I try to see if the slightly older code in the Belenix liveCD would maybe work with my pool. Nope -- because I upgraded my pool back when I thought OpenSolaris 2008.11 was going to work, so it''s too new a version for the code on that liveCD. I *did* try the 2008.11 liveCD, in addition to my installation, and they both crash when I import the pool. And I''m not going to do anything drastic and permanent today after all. Hoping for insight from somebody! On Sat, January 31, 2009 22:51, David Dyer-Bennet wrote:> As I''ve said, I''ve got this ZFS pool, with 450GB of my data in it, that > crashes the system when I import it. I''ve now got a dozen or so log > entries on this, having finally gotten it to happen in a controlled enough > environment that I can get at the log files and transfer them somewhere I > can access them. They all appear to be exactly the same thing. > > This crash happens nearly instantly after I do a "zpool import -f zp1". > Since the stack traceback suggests it''s involved in a scrub, and indeed I > do think it was scrubbing when this first started happening, I''ve tried > doing "zpool import -f zp1; zpool scrub -s zp1" in hopes that it will get > in in time to stop the hypothetical scrub. > > I''m running OpenSolaris 2008.11. > > I''d like to rescue this pool. If I can''t, though, I''ve got to destroy it > and restore from backup (which shouldn''t make things much worse than they > already are; my backups are in decent shape so far as I know). And I''ve > spent far far too much time on this (busy at work, other things at home, > so this has dragged on interminably). And I''d love to provide useful > information to anybody interested in finding and fixing whatever is wrong > in the code that left me in this position, of course. Nobody has > responded to my image of the stack traceback from yesterday, but I''m > hoping that now that I''ve managed to get more information (including the > stuff before the stack traceback), somebody may be able to do something > with it. I do also seem to have a dump, if that''s any use to anybody: > > Jan 31 22:34:11 fsfs genunix: [ID 111219 kern.notice] dumping to > /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel > Jan 31 22:34:21 fsfs genunix: [ID 409368 kern.notice] ^M100% done: 197356 > pages dumped, compression ratio 3.20, > Jan 31 22:34:21 fsfs genunix: [ID 851671 kern.notice] dump succeeded > > Here''s how the pool shows before I try to import it: > localddb at fsfs:~$ pfexec zpool import > pool: zp1 > id: 4278074723887284013 > state: ONLINE > action: The pool can be imported using its name or numeric identifier. > config: > > zp1 ONLINE > mirror ONLINE > c5t0d0 ONLINE > c5t1d0 ONLINE > mirror ONLINE > c6t0d0 ONLINE > c6t1d0 ONLINE > localddb at fsfs:~$ > > Should that report say if a scrub was in progress? > > And here''s the crash that happens immediately when I do import it: > > Jan 31 22:34:10 fsfs unix: [ID 836849 kern.notice] > Jan 31 22:34:10 fsfs ^Mpanic[cpu1]/thread=ffffff00045fdc80: > Jan 31 22:34:10 fsfs genunix: [ID 335743 kern.notice] BAD TRAP: type=e > (#pf Page fault) rp=ffffff00045fcd60 addr=4e8 occurred in module "unix" > due to a NULL pointer dereference > Jan 31 22:34:10 fsfs unix: [ID 100000 kern.notice] > Jan 31 22:34:10 fsfs unix: [ID 839527 kern.notice] sched: > Jan 31 22:34:10 fsfs unix: [ID 753105 kern.notice] #pf Page fault > Jan 31 22:34:10 fsfs unix: [ID 532287 kern.notice] Bad kernel fault at > addr=0x4e8 > Jan 31 22:34:10 fsfs unix: [ID 243837 kern.notice] pid=0, > pc=0xfffffffffb84e84b, sp=0xffffff00045fce58, eflags=0x10246 > Jan 31 22:34:10 fsfs unix: [ID 211416 kern.notice] cr0: > 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 6f8<xmme,fxsr,pge,mce,pae,pse,de> > Jan 31 22:34:10 fsfs unix: [ID 624947 kern.notice] cr2: 4e8 > Jan 31 22:34:10 fsfs unix: [ID 625075 kern.notice] cr3: 3400000 > Jan 31 22:34:10 fsfs unix: [ID 625715 kern.notice] cr8: c > Jan 31 22:34:10 fsfs unix: [ID 100000 kern.notice] > Jan 31 22:34:10 fsfs unix: [ID 592667 kern.notice] rdi: 4e8 > rsi: a200 rdx: ffffff00045fdc80 > Jan 31 22:34:10 fsfs unix: [ID 592667 kern.notice] rcx: 1 > r8: ffffff01599c8540 r9: 0 > Jan 31 22:34:10 fsfs unix: [ID 592667 kern.notice] rax: 0 > rbx: 0 rbp: ffffff00045fced0 > Jan 31 22:34:10 fsfs unix: [ID 592667 kern.notice] r10: 5b40 > r11: 0 r12: ffffff0161e92040 > Jan 31 22:34:10 fsfs unix: [ID 592667 kern.notice] r13: ffffff0176684800 > r14: ffffff0161e92040 r15: 0 > Jan 31 22:34:10 fsfs unix: [ID 592667 kern.notice] fsb: 0 > gsb: ffffff0149f68500 ds: 4b > Jan 31 22:34:10 fsfs unix: [ID 592667 kern.notice] es: 4b > fs: 0 gs: 1c3 > Jan 31 22:34:10 fsfs unix: [ID 592667 kern.notice] trp: e > err: 2 rip: fffffffffb84e84b > Jan 31 22:34:10 fsfs unix: [ID 592667 kern.notice] cs: 30 > rfl: 10246 rsp: ffffff00045fce58 > Jan 31 22:34:10 fsfs unix: [ID 266532 kern.notice] ss: 38 > Jan 31 22:34:10 fsfs unix: [ID 100000 kern.notice] > Jan 31 22:34:10 fsfs genunix: [ID 655072 kern.notice] ffffff00045fcc40 > unix:die+dd () > Jan 31 22:34:10 fsfs genunix: [ID 655072 kern.notice] ffffff00045fcd50 > unix:trap+1752 () > Jan 31 22:34:10 fsfs genunix: [ID 655072 kern.notice] ffffff00045fcd60 > unix:_cmntrap+e9 () > Jan 31 22:34:10 fsfs genunix: [ID 655072 kern.notice] ffffff00045fced0 > unix:mutex_enter+b () > Jan 31 22:34:10 fsfs genunix: [ID 655072 kern.notice] ffffff00045fcfe0 > zfs:scrub_visitbp+61b () > Jan 31 22:34:10 fsfs genunix: [ID 655072 kern.notice] ffffff00045fd0f0 > zfs:scrub_visitbp+5b3 () > Jan 31 22:34:10 fsfs genunix: [ID 655072 kern.notice] ffffff00045fd200 > zfs:scrub_visitbp+223 () > Jan 31 22:34:10 fsfs genunix: [ID 655072 kern.notice] ffffff00045fd310 > zfs:scrub_visitbp+282 () > Jan 31 22:34:10 fsfs genunix: [ID 655072 kern.notice] ffffff00045fd420 > zfs:scrub_visitbp+223 () > Jan 31 22:34:10 fsfs genunix: [ID 655072 kern.notice] ffffff00045fd530 > zfs:scrub_visitbp+223 () > Jan 31 22:34:10 fsfs genunix: [ID 655072 kern.notice] ffffff00045fd640 > zfs:scrub_visitbp+223 () > Jan 31 22:34:10 fsfs genunix: [ID 655072 kern.notice] ffffff00045fd750 > zfs:scrub_visitbp+223 () > Jan 31 22:34:10 fsfs genunix: [ID 655072 kern.notice] ffffff00045fd860 > zfs:scrub_visitbp+438 () > Jan 31 22:34:10 fsfs genunix: [ID 655072 kern.notice] ffffff00045fd8b0 > zfs:scrub_visit_rootbp+4f () > Jan 31 22:34:10 fsfs genunix: [ID 655072 kern.notice] ffffff00045fd910 > zfs:scrub_visitds+7e () > Jan 31 22:34:10 fsfs genunix: [ID 655072 kern.notice] ffffff00045fdac0 > zfs:dsl_pool_scrub_sync+123 () > Jan 31 22:34:10 fsfs genunix: [ID 655072 kern.notice] ffffff00045fdb30 > zfs:dsl_pool_sync+18c () > Jan 31 22:34:10 fsfs genunix: [ID 655072 kern.notice] ffffff00045fdbc0 > zfs:spa_sync+2af () > Jan 31 22:34:10 fsfs genunix: [ID 655072 kern.notice] ffffff00045fdc60 > zfs:txg_sync_thread+1fc () > Jan 31 22:34:10 fsfs genunix: [ID 655072 kern.notice] ffffff00045fdc70 > unix:thread_start+8 () > Jan 31 22:34:10 fsfs unix: [ID 100000 kern.notice] > Jan 31 22:34:10 fsfs genunix: [ID 672855 kern.notice] syncing file > systems... > Jan 31 22:34:10 fsfs genunix: [ID 904073 kern.notice] done > Jan 31 22:34:11 fsfs genunix: [ID 111219 kern.notice] dumping to > /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel > Jan 31 22:34:21 fsfs genunix: [ID 409368 kern.notice] ^M100% done: 197356 > pages dumped, compression ratio 3.20, > Jan 31 22:34:21 fsfs genunix: [ID 851671 kern.notice] dump succeeded > Jan 31 22:34:50 fsfs genunix: [ID 540533 kern.notice] ^MSunOS Release 5.11 > Version snv_101b 64-bit > Jan 31 22:34:50 fsfs genunix: [ID 172908 kern.notice] Copyright 1983-2008 > Sun Microsystems, Inc. All rights reserved. > Jan 31 22:34:50 fsfs Use is subject to license terms. > Jan 31 22:34:50 fsfs unix: [ID 126719 kern.info] features: > 613f6fff<cpuid,tscp,cmp,cx16,sse3,nx,asysc,sse2,sse,pat,cx8,pae,mca,mmx,cmov,de,pge,mtrr,msr,tsc,lgpg> > Jan 31 22:34:50 fsfs unix: [ID 168242 kern.info] mem = 2095676K > (0x7fe8f000) > Jan 31 22:34:50 fsfs unix: [ID 972737 kern.info] Skipping psm: xpv_psm > Jan 31 22:34:50 fsfs rootnex: [ID 466748 kern.info] root nexus = i86pc > Jan 31 22:34:50 fsfs iommulib: [ID 321598 kern.info] NOTICE: > iommulib_nexus_register: rootnex-1: Succesfully registered NEXUS i86pc > nexops=fffffffffbceadb0 > Jan 31 22:34:50 fsfs rootnex: [ID 349649 kern.info] pseudo0 at root > Jan 31 22:34:50 fsfs genunix: [ID 936769 kern.info] pseudo0 is /pseudo > Jan 31 22:34:50 fsfs rootnex: [ID 349649 kern.info] scsi_vhci0 at root > Jan 31 22:34:50 fsfs genunix: [ID 936769 kern.info] scsi_vhci0 is > /scsi_vhci > Jan 31 22:34:50 fsfs rootnex: [ID 349649 kern.info] isa0 at root > > > > -- > David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ > Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ > Photos: http://dd-b.net/photography/gallery/ > Dragaera: http://dragaera.info > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info