Brian Kolaci
2010-Jul-02 07:01 UTC
[zfs-discuss] pool wide corruption, "Bad exchange descriptor"
I''ve recently acquired some storage and have been trying to copy data
from a remote data center to hold backup data. The copies had been going for
weeks, with about 600GB transferred so far, and then I noticed the throughput on
the router stopped. I see a pool disappeared.
# zpool status -x
pool: pool4_green
state: FAULTED
status: The pool metadata is corrupted and the pool cannot be opened.
action: Destroy and re-create the pool from a backup source.
see: http://www.sun.com/msg/ZFS-8000-72
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
pool4_green FAULTED 0 0 1 corrupted data
raidz2 ONLINE 0 0 6
c10t0d0 ONLINE 0 0 0
c10t1d0 ONLINE 0 0 0
c10t2d0 ONLINE 0 0 0
c10t3d0 ONLINE 0 0 0
c10t4d0 ONLINE 0 0 0
c10t5d0 ONLINE 0 0 0
c10t6d0 ONLINE 0 0 0
I powered down the system and all the disk systems and powered up fresh.
I tried to clear the error:
#zpool clear pool4_green
internal error: Bad exchange descriptor
Abort (core dumped)
So then I took a look with zdb:
#zdb -vvv pool4_green
version=15
name=''pool4_green''
state=0
txg=83
pool_guid=3115817837859301858
hostid=237914636
hostname=''galaxy''
vdev_tree
type=''root''
id=0
guid=3115817837859301858
children[0]
type=''raidz''
id=0
guid=10261633106033684483
nparity=2
metaslab_array=24
metaslab_shift=36
ashift=9
asize=6997481881600
is_log=0
children[0]
type=''disk''
id=0
guid=11313548069045029894
path=''/dev/dsk/c10t0d0s0''
devid=''id1,sd at
n60026b9040e26100139d8514065a1d67/a''
phys_path=''/pci at 0,0/pci8086,3597 at
4/pci1028,1f0a at 0/sd at 0,0:a''
whole_disk=1
children[1]
type=''disk''
id=1
guid=5547727760941401848
path=''/dev/dsk/c10t1d0s0''
devid=''id1,sd at
n60026b9040e26100139d851d06ec511d/a''
phys_path=''/pci at 0,0/pci8086,3597 at
4/pci1028,1f0a at 0/sd at 1,0:a''
whole_disk=1
children[2]
type=''disk''
id=2
guid=8407102896612298450
path=''/dev/dsk/c10t2d0s0''
devid=''id1,sd at
n60026b9040e26100139d85260770c1bf/a''
phys_path=''/pci at 0,0/pci8086,3597 at
4/pci1028,1f0a at 0/sd at 2,0:a''
whole_disk=1
children[3]
type=''disk''
id=3
guid=17509238716791782209
path=''/dev/dsk/c10t3d0s0''
devid=''id1,sd at
n60026b9040e26100139d852f07fea314/a''
phys_path=''/pci at 0,0/pci8086,3597 at
4/pci1028,1f0a at 0/sd at 3,0:a''
whole_disk=1
children[4]
type=''disk''
id=4
guid=18419120996062075464
path=''/dev/dsk/c10t4d0s0''
devid=''id1,sd at
n60026b9040e26100139d8537086c271f/a''
phys_path=''/pci at 0,0/pci8086,3597 at
4/pci1028,1f0a at 0/sd at 4,0:a''
whole_disk=1
children[5]
type=''disk''
id=5
guid=8308368067368943006
path=''/dev/dsk/c10t5d0s0''
devid=''id1,sd at
n60026b9040e26100139d85440934f640/a''
phys_path=''/pci at 0,0/pci8086,3597 at
4/pci1028,1f0a at 0/sd at 5,0:a''
whole_disk=1
children[6]
type=''disk''
id=6
guid=14740659507803921957
path=''/dev/dsk/c10t6d0s0''
devid=''id1,sd at
n60026b9040e26100139d854a09957d56/a''
phys_path=''/pci at 0,0/pci8086,3597 at
4/pci1028,1f0a at 0/sd at 6,0:a''
whole_disk=1
zdb: can''t open pool4_green: Bad exchange descriptor
So what is this "Bad exchange descriptor" and do I have a prayer on
getting my data back?
Thanks,
Brian
Brian Kolaci
2010-Jul-06 14:30 UTC
[zfs-discuss] pool wide corruption, "Bad exchange descriptor"
Well, I see no takers or even a hint... I''ve been playing with zdb to try to examine the pool, but I get: # zdb -b pool4_green zdb: can''t open pool4_green: Bad exchange descriptor # zdb -d pool4_green zdb: can''t open pool4_green: Bad exchange descriptor So I''m not sure how to debug using zdb. Is there something better or something else I should be looking at? The disks are all there, all online. How can I at least rollback to the last consistent bit of data on there? Or is all hope lost and I lost over 600GB of data? The worst part is that there are no errors in the logs and it just "disappeared" without a trace. The only logs are from subsequent reboots where it says a ZFS pool failed to open. It does not give me a warm & fuzzy about using ZFS as I''ve relied on it heavily in the past 5 years. Any advice would be well appreciated... On 7/2/2010 3:01 AM, Brian Kolaci wrote:> I''ve recently acquired some storage and have been trying to copy data from a remote data center to hold backup data. The copies had been going for weeks, with about 600GB transferred so far, and then I noticed the throughput on the router stopped. I see a pool disappeared. > > # zpool status -x > > pool: pool4_green > state: FAULTED > status: The pool metadata is corrupted and the pool cannot be opened. > action: Destroy and re-create the pool from a backup source. > see: http://www.sun.com/msg/ZFS-8000-72 > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > pool4_green FAULTED 0 0 1 corrupted data > raidz2 ONLINE 0 0 6 > c10t0d0 ONLINE 0 0 0 > c10t1d0 ONLINE 0 0 0 > c10t2d0 ONLINE 0 0 0 > c10t3d0 ONLINE 0 0 0 > c10t4d0 ONLINE 0 0 0 > c10t5d0 ONLINE 0 0 0 > c10t6d0 ONLINE 0 0 0 > > I powered down the system and all the disk systems and powered up fresh. > I tried to clear the error: > > #zpool clear pool4_green > internal error: Bad exchange descriptor > Abort (core dumped) > > So then I took a look with zdb: > > #zdb -vvv pool4_green > version=15 > name=''pool4_green'' > state=0 > txg=83 > pool_guid=3115817837859301858 > hostid=237914636 > hostname=''galaxy'' > vdev_tree > type=''root'' > id=0 > guid=3115817837859301858 > children[0] > type=''raidz'' > id=0 > guid=10261633106033684483 > nparity=2 > metaslab_array=24 > metaslab_shift=36 > ashift=9 > asize=6997481881600 > is_log=0 > children[0] > type=''disk'' > id=0 > guid=11313548069045029894 > path=''/dev/dsk/c10t0d0s0'' > devid=''id1,sd at n60026b9040e26100139d8514065a1d67/a'' > phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 0,0:a'' > whole_disk=1 > children[1] > type=''disk'' > id=1 > guid=5547727760941401848 > path=''/dev/dsk/c10t1d0s0'' > devid=''id1,sd at n60026b9040e26100139d851d06ec511d/a'' > phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 1,0:a'' > whole_disk=1 > children[2] > type=''disk'' > id=2 > guid=8407102896612298450 > path=''/dev/dsk/c10t2d0s0'' > devid=''id1,sd at n60026b9040e26100139d85260770c1bf/a'' > phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 2,0:a'' > whole_disk=1 > children[3] > type=''disk'' > id=3 > guid=17509238716791782209 > path=''/dev/dsk/c10t3d0s0'' > devid=''id1,sd at n60026b9040e26100139d852f07fea314/a'' > phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 3,0:a'' > whole_disk=1 > children[4] > type=''disk'' > id=4 > guid=18419120996062075464 > path=''/dev/dsk/c10t4d0s0'' > devid=''id1,sd at n60026b9040e26100139d8537086c271f/a'' > phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 4,0:a'' > whole_disk=1 > children[5] > type=''disk'' > id=5 > guid=8308368067368943006 > path=''/dev/dsk/c10t5d0s0'' > devid=''id1,sd at n60026b9040e26100139d85440934f640/a'' > phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 5,0:a'' > whole_disk=1 > children[6] > type=''disk'' > id=6 > guid=14740659507803921957 > path=''/dev/dsk/c10t6d0s0'' > devid=''id1,sd at n60026b9040e26100139d854a09957d56/a'' > phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 6,0:a'' > whole_disk=1 > zdb: can''t open pool4_green: Bad exchange descriptor > > > So what is this "Bad exchange descriptor" and do I have a prayer on getting my data back? > > Thanks, > > Brian
Victor Latushkin
2010-Jul-06 14:37 UTC
[zfs-discuss] pool wide corruption, "Bad exchange descriptor"
On Jul 6, 2010, at 6:30 PM, Brian Kolaci wrote:> Well, I see no takers or even a hint... > > I''ve been playing with zdb to try to examine the pool, but I get: > > # zdb -b pool4_green > zdb: can''t open pool4_green: Bad exchange descriptor > > # zdb -d pool4_green > zdb: can''t open pool4_green: Bad exchange descriptor > > So I''m not sure how to debug using zdb. Is there something better or something else I should be looking at? > The disks are all there, all online. How can I at least rollback to the last consistent bit of data on there? > Or is all hope lost and I lost over 600GB of data? > > The worst part is that there are no errors in the logs and it just "disappeared" without a trace. > The only logs are from subsequent reboots where it says a ZFS pool failed to open. > > It does not give me a warm & fuzzy about using ZFS as I''ve relied on it heavily in the past 5 years. > > Any advice would be well appreciated...You can download build 134 LiveCD boot off it and try ''zpool import -nfF pool4_green'' for a start. regards victor> On 7/2/2010 3:01 AM, Brian Kolaci wrote: >> I''ve recently acquired some storage and have been trying to copy data from a remote data center to hold backup data. The copies had been going for weeks, with about 600GB transferred so far, and then I noticed the throughput on the router stopped. I see a pool disappeared. >> >> # zpool status -x >> >> pool: pool4_green >> state: FAULTED >> status: The pool metadata is corrupted and the pool cannot be opened. >> action: Destroy and re-create the pool from a backup source. >> see: http://www.sun.com/msg/ZFS-8000-72 >> scrub: none requested >> config: >> >> NAME STATE READ WRITE CKSUM >> pool4_green FAULTED 0 0 1 corrupted data >> raidz2 ONLINE 0 0 6 >> c10t0d0 ONLINE 0 0 0 >> c10t1d0 ONLINE 0 0 0 >> c10t2d0 ONLINE 0 0 0 >> c10t3d0 ONLINE 0 0 0 >> c10t4d0 ONLINE 0 0 0 >> c10t5d0 ONLINE 0 0 0 >> c10t6d0 ONLINE 0 0 0 >> >> I powered down the system and all the disk systems and powered up fresh. >> I tried to clear the error: >> >> #zpool clear pool4_green >> internal error: Bad exchange descriptor >> Abort (core dumped) >> >> So then I took a look with zdb: >> >> #zdb -vvv pool4_green >> version=15 >> name=''pool4_green'' >> state=0 >> txg=83 >> pool_guid=3115817837859301858 >> hostid=237914636 >> hostname=''galaxy'' >> vdev_tree >> type=''root'' >> id=0 >> guid=3115817837859301858 >> children[0] >> type=''raidz'' >> id=0 >> guid=10261633106033684483 >> nparity=2 >> metaslab_array=24 >> metaslab_shift=36 >> ashift=9 >> asize=6997481881600 >> is_log=0 >> children[0] >> type=''disk'' >> id=0 >> guid=11313548069045029894 >> path=''/dev/dsk/c10t0d0s0'' >> devid=''id1,sd at n60026b9040e26100139d8514065a1d67/a'' >> phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 0,0:a'' >> whole_disk=1 >> children[1] >> type=''disk'' >> id=1 >> guid=5547727760941401848 >> path=''/dev/dsk/c10t1d0s0'' >> devid=''id1,sd at n60026b9040e26100139d851d06ec511d/a'' >> phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 1,0:a'' >> whole_disk=1 >> children[2] >> type=''disk'' >> id=2 >> guid=8407102896612298450 >> path=''/dev/dsk/c10t2d0s0'' >> devid=''id1,sd at n60026b9040e26100139d85260770c1bf/a'' >> phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 2,0:a'' >> whole_disk=1 >> children[3] >> type=''disk'' >> id=3 >> guid=17509238716791782209 >> path=''/dev/dsk/c10t3d0s0'' >> devid=''id1,sd at n60026b9040e26100139d852f07fea314/a'' >> phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 3,0:a'' >> whole_disk=1 >> children[4] >> type=''disk'' >> id=4 >> guid=18419120996062075464 >> path=''/dev/dsk/c10t4d0s0'' >> devid=''id1,sd at n60026b9040e26100139d8537086c271f/a'' >> phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 4,0:a'' >> whole_disk=1 >> children[5] >> type=''disk'' >> id=5 >> guid=8308368067368943006 >> path=''/dev/dsk/c10t5d0s0'' >> devid=''id1,sd at n60026b9040e26100139d85440934f640/a'' >> phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 5,0:a'' >> whole_disk=1 >> children[6] >> type=''disk'' >> id=6 >> guid=14740659507803921957 >> path=''/dev/dsk/c10t6d0s0'' >> devid=''id1,sd at n60026b9040e26100139d854a09957d56/a'' >> phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 6,0:a'' >> whole_disk=1 >> zdb: can''t open pool4_green: Bad exchange descriptor >> >> >> So what is this "Bad exchange descriptor" and do I have a prayer on getting my data back? >> >> Thanks, >> >> Brian > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Richard Elling
2010-Jul-06 23:12 UTC
[zfs-discuss] pool wide corruption, "Bad exchange descriptor"
On Jul 6, 2010, at 7:30 AM, Brian Kolaci wrote:> Well, I see no takers or even a hint... > > I''ve been playing with zdb to try to examine the pool, but I get: > > # zdb -b pool4_green > zdb: can''t open pool4_green: Bad exchange descriptorFor the archives, EBADE "Bad exchange descriptor" was repurposed as ECKSUM in zio.h. This agrees with the zpool status below. I recommend Victor''s advice. -- richard> > # zdb -d pool4_green > zdb: can''t open pool4_green: Bad exchange descriptor > > So I''m not sure how to debug using zdb. Is there something better or something else I should be looking at? > The disks are all there, all online. How can I at least rollback to the last consistent bit of data on there? > Or is all hope lost and I lost over 600GB of data? > > The worst part is that there are no errors in the logs and it just "disappeared" without a trace. > The only logs are from subsequent reboots where it says a ZFS pool failed to open. > > It does not give me a warm & fuzzy about using ZFS as I''ve relied on it heavily in the past 5 years. > > Any advice would be well appreciated... > > On 7/2/2010 3:01 AM, Brian Kolaci wrote: >> I''ve recently acquired some storage and have been trying to copy data from a remote data center to hold backup data. The copies had been going for weeks, with about 600GB transferred so far, and then I noticed the throughput on the router stopped. I see a pool disappeared. >> >> # zpool status -x >> >> pool: pool4_green >> state: FAULTED >> status: The pool metadata is corrupted and the pool cannot be opened. >> action: Destroy and re-create the pool from a backup source. >> see: http://www.sun.com/msg/ZFS-8000-72 >> scrub: none requested >> config: >> >> NAME STATE READ WRITE CKSUM >> pool4_green FAULTED 0 0 1 corrupted data >> raidz2 ONLINE 0 0 6 >> c10t0d0 ONLINE 0 0 0 >> c10t1d0 ONLINE 0 0 0 >> c10t2d0 ONLINE 0 0 0 >> c10t3d0 ONLINE 0 0 0 >> c10t4d0 ONLINE 0 0 0 >> c10t5d0 ONLINE 0 0 0 >> c10t6d0 ONLINE 0 0 0 >> >> I powered down the system and all the disk systems and powered up fresh. >> I tried to clear the error: >> >> #zpool clear pool4_green >> internal error: Bad exchange descriptor >> Abort (core dumped) >> >> So then I took a look with zdb: >> >> #zdb -vvv pool4_green >> version=15 >> name=''pool4_green'' >> state=0 >> txg=83 >> pool_guid=3115817837859301858 >> hostid=237914636 >> hostname=''galaxy'' >> vdev_tree >> type=''root'' >> id=0 >> guid=3115817837859301858 >> children[0] >> type=''raidz'' >> id=0 >> guid=10261633106033684483 >> nparity=2 >> metaslab_array=24 >> metaslab_shift=36 >> ashift=9 >> asize=6997481881600 >> is_log=0 >> children[0] >> type=''disk'' >> id=0 >> guid=11313548069045029894 >> path=''/dev/dsk/c10t0d0s0'' >> devid=''id1,sd at n60026b9040e26100139d8514065a1d67/a'' >> phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 0,0:a'' >> whole_disk=1 >> children[1] >> type=''disk'' >> id=1 >> guid=5547727760941401848 >> path=''/dev/dsk/c10t1d0s0'' >> devid=''id1,sd at n60026b9040e26100139d851d06ec511d/a'' >> phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 1,0:a'' >> whole_disk=1 >> children[2] >> type=''disk'' >> id=2 >> guid=8407102896612298450 >> path=''/dev/dsk/c10t2d0s0'' >> devid=''id1,sd at n60026b9040e26100139d85260770c1bf/a'' >> phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 2,0:a'' >> whole_disk=1 >> children[3] >> type=''disk'' >> id=3 >> guid=17509238716791782209 >> path=''/dev/dsk/c10t3d0s0'' >> devid=''id1,sd at n60026b9040e26100139d852f07fea314/a'' >> phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 3,0:a'' >> whole_disk=1 >> children[4] >> type=''disk'' >> id=4 >> guid=18419120996062075464 >> path=''/dev/dsk/c10t4d0s0'' >> devid=''id1,sd at n60026b9040e26100139d8537086c271f/a'' >> phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 4,0:a'' >> whole_disk=1 >> children[5] >> type=''disk'' >> id=5 >> guid=8308368067368943006 >> path=''/dev/dsk/c10t5d0s0'' >> devid=''id1,sd at n60026b9040e26100139d85440934f640/a'' >> phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 5,0:a'' >> whole_disk=1 >> children[6] >> type=''disk'' >> id=6 >> guid=14740659507803921957 >> path=''/dev/dsk/c10t6d0s0'' >> devid=''id1,sd at n60026b9040e26100139d854a09957d56/a'' >> phys_path=''/pci at 0,0/pci8086,3597 at 4/pci1028,1f0a at 0/sd at 6,0:a'' >> whole_disk=1 >> zdb: can''t open pool4_green: Bad exchange descriptor >> >> >> So what is this "Bad exchange descriptor" and do I have a prayer on getting my data back? >> >> Thanks, >> >> Brian > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- ZFS and NexentaStor training, Rotterdam, July 13-15, 2010 http://nexenta-rotterdam.eventbrite.com/
Brian Kolaci
2010-Jul-07 23:24 UTC
[zfs-discuss] pool wide corruption, "Bad exchange descriptor"
On 7/6/2010 10:37 AM, Victor Latushkin wrote:> > On Jul 6, 2010, at 6:30 PM, Brian Kolaci wrote: > >> Well, I see no takers or even a hint... >> >> I''ve been playing with zdb to try to examine the pool, but I get: >> >> # zdb -b pool4_green >> zdb: can''t open pool4_green: Bad exchange descriptor >> >> # zdb -d pool4_green >> zdb: can''t open pool4_green: Bad exchange descriptor >> >> So I''m not sure how to debug using zdb. Is there something better or something else I should be looking at? >> The disks are all there, all online. How can I at least rollback to the last consistent bit of data on there? >> Or is all hope lost and I lost over 600GB of data? >> >> The worst part is that there are no errors in the logs and it just "disappeared" without a trace. >> The only logs are from subsequent reboots where it says a ZFS pool failed to open. >> >> It does not give me a warm& fuzzy about using ZFS as I''ve relied on it heavily in the past 5 years. >> >> Any advice would be well appreciated... > > You can download build 134 LiveCD boot off it and try ''zpool import -nfF pool4_green'' for a start. > > regards > victor > >Thank you Victor! That did it. It recovered the pool and I lost only 30 seconds of transactions only. This helps alot. There was actually 2.5TB of data, not just 600GB. I can''t wait until the -F (recovery option) makes it back into Solaris 10 (or will it?).