Brian Kolaci
2010-Jul-02 07:01 UTC
[zfs-discuss] pool wide corruption, "Bad exchange descriptor"
I''ve recently acquired some storage and have been trying to copy data from a remote data center to hold backup data. The copies had been going for weeks, with about 600GB transferred so far, and then I noticed the throughput on the router stopped. I see a pool disappeared. # zpool status -x pool: pool4_green state: FAULTED status: The pool metadata is corrupted and the pool cannot be opened. action: Destroy and re-create the pool from a backup source. see: sun.com/msg/ZFS-8000-72 scrub: none requested config: NAME STATE READ WRITE CKSUM pool4_green FAULTED 0 0 1 corrupted data raidz2 ONLINE 0 0 6 c10t0d0 ONLINE 0 0 0 c10t1d0 ONLINE 0 0 0 c10t2d0 ONLINE 0 0 0 c10t3d0 ONLINE 0 0 0 c10t4d0 ONLINE 0 0 0 c10t5d0 ONLINE 0 0 0 c10t6d0 ONLINE 0 0 0 I powered down the system and all the disk systems and powered up fresh. I tried to clear the error: #zpool clear pool4_green internal error: Bad exchange descriptor Abort (core dumped) So then I took a look with zdb: #zdb -vvv pool4_green version=15 name=''pool4_green'' state=0 txg=83 pool_guid=3115817837859301858 hostid=237914636 hostname=''galaxy'' vdev_tree type=''root'' id=0 guid=3115817837859301858 children[0] type=''raidz'' id=0 guid=10261633106033684483 nparity=2 metaslab_array=24 metaslab_shift=36 ashift=9 asize=6997481881600 is_log=0 children[0] type=''disk'' id=0 guid=11313548069045029894 path=''/dev/dsk/c10t0d0s0'' devid=''id1,sd at n60026b9040e26100139d8514065a1d67/a'' phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 0,0:a'' whole_disk=1 children[1] type=''disk'' id=1 guid=5547727760941401848 path=''/dev/dsk/c10t1d0s0'' devid=''id1,sd at n60026b9040e26100139d851d06ec511d/a'' phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 1,0:a'' whole_disk=1 children[2] type=''disk'' id=2 guid=8407102896612298450 path=''/dev/dsk/c10t2d0s0'' devid=''id1,sd at n60026b9040e26100139d85260770c1bf/a'' phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 2,0:a'' whole_disk=1 children[3] type=''disk'' id=3 guid=17509238716791782209 path=''/dev/dsk/c10t3d0s0'' devid=''id1,sd at n60026b9040e26100139d852f07fea314/a'' phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 3,0:a'' whole_disk=1 children[4] type=''disk'' id=4 guid=18419120996062075464 path=''/dev/dsk/c10t4d0s0'' devid=''id1,sd at n60026b9040e26100139d8537086c271f/a'' phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 4,0:a'' whole_disk=1 children[5] type=''disk'' id=5 guid=8308368067368943006 path=''/dev/dsk/c10t5d0s0'' devid=''id1,sd at n60026b9040e26100139d85440934f640/a'' phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 5,0:a'' whole_disk=1 children[6] type=''disk'' id=6 guid=14740659507803921957 path=''/dev/dsk/c10t6d0s0'' devid=''id1,sd at n60026b9040e26100139d854a09957d56/a'' phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 6,0:a'' whole_disk=1 zdb: can''t open pool4_green: Bad exchange descriptor So what is this "Bad exchange descriptor" and do I have a prayer on getting my data back? Thanks, Brian
Brian Kolaci
2010-Jul-06 14:30 UTC
[zfs-discuss] pool wide corruption, "Bad exchange descriptor"
Well, I see no takers or even a hint... I''ve been playing with zdb to try to examine the pool, but I get: # zdb -b pool4_green zdb: can''t open pool4_green: Bad exchange descriptor # zdb -d pool4_green zdb: can''t open pool4_green: Bad exchange descriptor So I''m not sure how to debug using zdb. Is there something better or something else I should be looking at? The disks are all there, all online. How can I at least rollback to the last consistent bit of data on there? Or is all hope lost and I lost over 600GB of data? The worst part is that there are no errors in the logs and it just "disappeared" without a trace. The only logs are from subsequent reboots where it says a ZFS pool failed to open. It does not give me a warm & fuzzy about using ZFS as I''ve relied on it heavily in the past 5 years. Any advice would be well appreciated... On 7/2/2010 3:01 AM, Brian Kolaci wrote:> I''ve recently acquired some storage and have been trying to copy data from a remote data center to hold backup data. The copies had been going for weeks, with about 600GB transferred so far, and then I noticed the throughput on the router stopped. I see a pool disappeared. > > # zpool status -x > > pool: pool4_green > state: FAULTED > status: The pool metadata is corrupted and the pool cannot be opened. > action: Destroy and re-create the pool from a backup source. > see: sun.com/msg/ZFS-8000-72 > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > pool4_green FAULTED 0 0 1 corrupted data > raidz2 ONLINE 0 0 6 > c10t0d0 ONLINE 0 0 0 > c10t1d0 ONLINE 0 0 0 > c10t2d0 ONLINE 0 0 0 > c10t3d0 ONLINE 0 0 0 > c10t4d0 ONLINE 0 0 0 > c10t5d0 ONLINE 0 0 0 > c10t6d0 ONLINE 0 0 0 > > I powered down the system and all the disk systems and powered up fresh. > I tried to clear the error: > > #zpool clear pool4_green > internal error: Bad exchange descriptor > Abort (core dumped) > > So then I took a look with zdb: > > #zdb -vvv pool4_green > version=15 > name=''pool4_green'' > state=0 > txg=83 > pool_guid=3115817837859301858 > hostid=237914636 > hostname=''galaxy'' > vdev_tree > type=''root'' > id=0 > guid=3115817837859301858 > children[0] > type=''raidz'' > id=0 > guid=10261633106033684483 > nparity=2 > metaslab_array=24 > metaslab_shift=36 > ashift=9 > asize=6997481881600 > is_log=0 > children[0] > type=''disk'' > id=0 > guid=11313548069045029894 > path=''/dev/dsk/c10t0d0s0'' > devid=''id1,sd at n60026b9040e26100139d8514065a1d67/a'' > phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 0,0:a'' > whole_disk=1 > children[1] > type=''disk'' > id=1 > guid=5547727760941401848 > path=''/dev/dsk/c10t1d0s0'' > devid=''id1,sd at n60026b9040e26100139d851d06ec511d/a'' > phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 1,0:a'' > whole_disk=1 > children[2] > type=''disk'' > id=2 > guid=8407102896612298450 > path=''/dev/dsk/c10t2d0s0'' > devid=''id1,sd at n60026b9040e26100139d85260770c1bf/a'' > phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 2,0:a'' > whole_disk=1 > children[3] > type=''disk'' > id=3 > guid=17509238716791782209 > path=''/dev/dsk/c10t3d0s0'' > devid=''id1,sd at n60026b9040e26100139d852f07fea314/a'' > phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 3,0:a'' > whole_disk=1 > children[4] > type=''disk'' > id=4 > guid=18419120996062075464 > path=''/dev/dsk/c10t4d0s0'' > devid=''id1,sd at n60026b9040e26100139d8537086c271f/a'' > phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 4,0:a'' > whole_disk=1 > children[5] > type=''disk'' > id=5 > guid=8308368067368943006 > path=''/dev/dsk/c10t5d0s0'' > devid=''id1,sd at n60026b9040e26100139d85440934f640/a'' > phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 5,0:a'' > whole_disk=1 > children[6] > type=''disk'' > id=6 > guid=14740659507803921957 > path=''/dev/dsk/c10t6d0s0'' > devid=''id1,sd at n60026b9040e26100139d854a09957d56/a'' > phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 6,0:a'' > whole_disk=1 > zdb: can''t open pool4_green: Bad exchange descriptor > > > So what is this "Bad exchange descriptor" and do I have a prayer on getting my data back? > > Thanks, > > Brian
Victor Latushkin
2010-Jul-06 14:37 UTC
[zfs-discuss] pool wide corruption, "Bad exchange descriptor"
On Jul 6, 2010, at 6:30 PM, Brian Kolaci wrote:> Well, I see no takers or even a hint... > > I''ve been playing with zdb to try to examine the pool, but I get: > > # zdb -b pool4_green > zdb: can''t open pool4_green: Bad exchange descriptor > > # zdb -d pool4_green > zdb: can''t open pool4_green: Bad exchange descriptor > > So I''m not sure how to debug using zdb. Is there something better or something else I should be looking at? > The disks are all there, all online. How can I at least rollback to the last consistent bit of data on there? > Or is all hope lost and I lost over 600GB of data? > > The worst part is that there are no errors in the logs and it just "disappeared" without a trace. > The only logs are from subsequent reboots where it says a ZFS pool failed to open. > > It does not give me a warm & fuzzy about using ZFS as I''ve relied on it heavily in the past 5 years. > > Any advice would be well appreciated...You can download build 134 LiveCD boot off it and try ''zpool import -nfF pool4_green'' for a start. regards victor> On 7/2/2010 3:01 AM, Brian Kolaci wrote: >> I''ve recently acquired some storage and have been trying to copy data from a remote data center to hold backup data. The copies had been going for weeks, with about 600GB transferred so far, and then I noticed the throughput on the router stopped. I see a pool disappeared. >> >> # zpool status -x >> >> pool: pool4_green >> state: FAULTED >> status: The pool metadata is corrupted and the pool cannot be opened. >> action: Destroy and re-create the pool from a backup source. >> see: sun.com/msg/ZFS-8000-72 >> scrub: none requested >> config: >> >> NAME STATE READ WRITE CKSUM >> pool4_green FAULTED 0 0 1 corrupted data >> raidz2 ONLINE 0 0 6 >> c10t0d0 ONLINE 0 0 0 >> c10t1d0 ONLINE 0 0 0 >> c10t2d0 ONLINE 0 0 0 >> c10t3d0 ONLINE 0 0 0 >> c10t4d0 ONLINE 0 0 0 >> c10t5d0 ONLINE 0 0 0 >> c10t6d0 ONLINE 0 0 0 >> >> I powered down the system and all the disk systems and powered up fresh. >> I tried to clear the error: >> >> #zpool clear pool4_green >> internal error: Bad exchange descriptor >> Abort (core dumped) >> >> So then I took a look with zdb: >> >> #zdb -vvv pool4_green >> version=15 >> name=''pool4_green'' >> state=0 >> txg=83 >> pool_guid=3115817837859301858 >> hostid=237914636 >> hostname=''galaxy'' >> vdev_tree >> type=''root'' >> id=0 >> guid=3115817837859301858 >> children[0] >> type=''raidz'' >> id=0 >> guid=10261633106033684483 >> nparity=2 >> metaslab_array=24 >> metaslab_shift=36 >> ashift=9 >> asize=6997481881600 >> is_log=0 >> children[0] >> type=''disk'' >> id=0 >> guid=11313548069045029894 >> path=''/dev/dsk/c10t0d0s0'' >> devid=''id1,sd at n60026b9040e26100139d8514065a1d67/a'' >> phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 0,0:a'' >> whole_disk=1 >> children[1] >> type=''disk'' >> id=1 >> guid=5547727760941401848 >> path=''/dev/dsk/c10t1d0s0'' >> devid=''id1,sd at n60026b9040e26100139d851d06ec511d/a'' >> phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 1,0:a'' >> whole_disk=1 >> children[2] >> type=''disk'' >> id=2 >> guid=8407102896612298450 >> path=''/dev/dsk/c10t2d0s0'' >> devid=''id1,sd at n60026b9040e26100139d85260770c1bf/a'' >> phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 2,0:a'' >> whole_disk=1 >> children[3] >> type=''disk'' >> id=3 >> guid=17509238716791782209 >> path=''/dev/dsk/c10t3d0s0'' >> devid=''id1,sd at n60026b9040e26100139d852f07fea314/a'' >> phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 3,0:a'' >> whole_disk=1 >> children[4] >> type=''disk'' >> id=4 >> guid=18419120996062075464 >> path=''/dev/dsk/c10t4d0s0'' >> devid=''id1,sd at n60026b9040e26100139d8537086c271f/a'' >> phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 4,0:a'' >> whole_disk=1 >> children[5] >> type=''disk'' >> id=5 >> guid=8308368067368943006 >> path=''/dev/dsk/c10t5d0s0'' >> devid=''id1,sd at n60026b9040e26100139d85440934f640/a'' >> phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 5,0:a'' >> whole_disk=1 >> children[6] >> type=''disk'' >> id=6 >> guid=14740659507803921957 >> path=''/dev/dsk/c10t6d0s0'' >> devid=''id1,sd at n60026b9040e26100139d854a09957d56/a'' >> phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 6,0:a'' >> whole_disk=1 >> zdb: can''t open pool4_green: Bad exchange descriptor >> >> >> So what is this "Bad exchange descriptor" and do I have a prayer on getting my data back? >> >> Thanks, >> >> Brian > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > mail.opensolaris.org/mailman/listinfo/zfs-discuss
Richard Elling
2010-Jul-06 23:12 UTC
[zfs-discuss] pool wide corruption, "Bad exchange descriptor"
On Jul 6, 2010, at 7:30 AM, Brian Kolaci wrote:> Well, I see no takers or even a hint... > > I''ve been playing with zdb to try to examine the pool, but I get: > > # zdb -b pool4_green > zdb: can''t open pool4_green: Bad exchange descriptorFor the archives, EBADE "Bad exchange descriptor" was repurposed as ECKSUM in zio.h. This agrees with the zpool status below. I recommend Victor''s advice. -- richard> > # zdb -d pool4_green > zdb: can''t open pool4_green: Bad exchange descriptor > > So I''m not sure how to debug using zdb. Is there something better or something else I should be looking at? > The disks are all there, all online. How can I at least rollback to the last consistent bit of data on there? > Or is all hope lost and I lost over 600GB of data? > > The worst part is that there are no errors in the logs and it just "disappeared" without a trace. > The only logs are from subsequent reboots where it says a ZFS pool failed to open. > > It does not give me a warm & fuzzy about using ZFS as I''ve relied on it heavily in the past 5 years. > > Any advice would be well appreciated... > > On 7/2/2010 3:01 AM, Brian Kolaci wrote: >> I''ve recently acquired some storage and have been trying to copy data from a remote data center to hold backup data. The copies had been going for weeks, with about 600GB transferred so far, and then I noticed the throughput on the router stopped. I see a pool disappeared. >> >> # zpool status -x >> >> pool: pool4_green >> state: FAULTED >> status: The pool metadata is corrupted and the pool cannot be opened. >> action: Destroy and re-create the pool from a backup source. >> see: sun.com/msg/ZFS-8000-72 >> scrub: none requested >> config: >> >> NAME STATE READ WRITE CKSUM >> pool4_green FAULTED 0 0 1 corrupted data >> raidz2 ONLINE 0 0 6 >> c10t0d0 ONLINE 0 0 0 >> c10t1d0 ONLINE 0 0 0 >> c10t2d0 ONLINE 0 0 0 >> c10t3d0 ONLINE 0 0 0 >> c10t4d0 ONLINE 0 0 0 >> c10t5d0 ONLINE 0 0 0 >> c10t6d0 ONLINE 0 0 0 >> >> I powered down the system and all the disk systems and powered up fresh. >> I tried to clear the error: >> >> #zpool clear pool4_green >> internal error: Bad exchange descriptor >> Abort (core dumped) >> >> So then I took a look with zdb: >> >> #zdb -vvv pool4_green >> version=15 >> name=''pool4_green'' >> state=0 >> txg=83 >> pool_guid=3115817837859301858 >> hostid=237914636 >> hostname=''galaxy'' >> vdev_tree >> type=''root'' >> id=0 >> guid=3115817837859301858 >> children[0] >> type=''raidz'' >> id=0 >> guid=10261633106033684483 >> nparity=2 >> metaslab_array=24 >> metaslab_shift=36 >> ashift=9 >> asize=6997481881600 >> is_log=0 >> children[0] >> type=''disk'' >> id=0 >> guid=11313548069045029894 >> path=''/dev/dsk/c10t0d0s0'' >> devid=''id1,sd at n60026b9040e26100139d8514065a1d67/a'' >> phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 0,0:a'' >> whole_disk=1 >> children[1] >> type=''disk'' >> id=1 >> guid=5547727760941401848 >> path=''/dev/dsk/c10t1d0s0'' >> devid=''id1,sd at n60026b9040e26100139d851d06ec511d/a'' >> phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 1,0:a'' >> whole_disk=1 >> children[2] >> type=''disk'' >> id=2 >> guid=8407102896612298450 >> path=''/dev/dsk/c10t2d0s0'' >> devid=''id1,sd at n60026b9040e26100139d85260770c1bf/a'' >> phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 2,0:a'' >> whole_disk=1 >> children[3] >> type=''disk'' >> id=3 >> guid=17509238716791782209 >> path=''/dev/dsk/c10t3d0s0'' >> devid=''id1,sd at n60026b9040e26100139d852f07fea314/a'' >> phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 3,0:a'' >> whole_disk=1 >> children[4] >> type=''disk'' >> id=4 >> guid=18419120996062075464 >> path=''/dev/dsk/c10t4d0s0'' >> devid=''id1,sd at n60026b9040e26100139d8537086c271f/a'' >> phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 4,0:a'' >> whole_disk=1 >> children[5] >> type=''disk'' >> id=5 >> guid=8308368067368943006 >> path=''/dev/dsk/c10t5d0s0'' >> devid=''id1,sd at n60026b9040e26100139d85440934f640/a'' >> phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 5,0:a'' >> whole_disk=1 >> children[6] >> type=''disk'' >> id=6 >> guid=14740659507803921957 >> path=''/dev/dsk/c10t6d0s0'' >> devid=''id1,sd at n60026b9040e26100139d854a09957d56/a'' >> phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 6,0:a'' >> whole_disk=1 >> zdb: can''t open pool4_green: Bad exchange descriptor >> >> >> So what is this "Bad exchange descriptor" and do I have a prayer on getting my data back? >> >> Thanks, >> >> Brian > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > mail.opensolaris.org/mailman/listinfo/zfs-discuss-- ZFS and NexentaStor training, Rotterdam, July 13-15, 2010 nexenta-rotterdam.eventbrite.com
Brian Kolaci
2010-Jul-07 23:24 UTC
[zfs-discuss] pool wide corruption, "Bad exchange descriptor"
On 7/6/2010 10:37 AM, Victor Latushkin wrote:> > On Jul 6, 2010, at 6:30 PM, Brian Kolaci wrote: > >> Well, I see no takers or even a hint... >> >> I''ve been playing with zdb to try to examine the pool, but I get: >> >> # zdb -b pool4_green >> zdb: can''t open pool4_green: Bad exchange descriptor >> >> # zdb -d pool4_green >> zdb: can''t open pool4_green: Bad exchange descriptor >> >> So I''m not sure how to debug using zdb. Is there something better or something else I should be looking at? >> The disks are all there, all online. How can I at least rollback to the last consistent bit of data on there? >> Or is all hope lost and I lost over 600GB of data? >> >> The worst part is that there are no errors in the logs and it just "disappeared" without a trace. >> The only logs are from subsequent reboots where it says a ZFS pool failed to open. >> >> It does not give me a warm& fuzzy about using ZFS as I''ve relied on it heavily in the past 5 years. >> >> Any advice would be well appreciated... > > You can download build 134 LiveCD boot off it and try ''zpool import -nfF pool4_green'' for a start. > > regards > victor > >Thank you Victor! That did it. It recovered the pool and I lost only 30 seconds of transactions only. This helps alot. There was actually 2.5TB of data, not just 600GB. I can''t wait until the -F (recovery option) makes it back into Solaris 10 (or will it?).