Hi List,
First of all: S10u4 120011-14
So I have the weird situation. Earlier this week, I finally mirrored up
two iSCSI based pools. I had been wanting to do this for some time,
because the availability of the data in these pools is important. One
pool mirrored just fine, but the other pool is another story.
First lesson (I think) is you should scrub your pools, at least those
backed by a SAN, before mirroring them. The problem pool was scrubbed
about two weeks before I mirrored it, and it was clean. I assumed,
wrongly that there were no checksum errors in the time that elapsed.
Well guess again. When I mirrored this guy, the source mirror had two
checksum errors. Interestingly, the target inherited these errors, and
so now both sides of the mirror showed two checksums in the counters. I
don''t know if this was real, or if the zpool attach operation just
incremented the counters on the second half of the mirror.
My next mistake was to assume the counters were in error on the second
mirror, and so I zeroed out the counters with zpool clear. OK, so now I
scrub the pool, and no checksum errors were found on either side of the
mirror. Huh?!? What about those two checksum errors on the first
mirror. OK, so I run zdb on the pool, and if finds scads of errors:
Traversing all blocks to verify checksums and verify nothing leaked ...
zdb_blkptr_cb: Got error 50 reading <33, 727252, 0, 4a> -- skipping--
...
and then tons of:
Error counts:
errno count
50 123
leaked space: vdev 0, offset 0x4deaed800, size 2048
...
OK, this is odd, so I scrub the pool again, and this time it found 4
checksum errors, on the initial mirror, but none on the other mirror.
That makes some sense, (though I don''t know what changed) so I break
the
mirror, taking off the original side that has the checksum errs. I then
scrub the pool, no errors found. That''s good, but just to be sure, I
run
zdb on it, and it finds tons of the same errors as if found on the
original side of the mirror. Argh!
In the mean time, I ran 4 passes of format-> analyze -> compare on the
initial half of the mirror that had the checksums and it''s totally
clean
hardware wise.
So my questions are these:
1) Does zdb leaked space mean trouble with the pool?
2) Is it possible that the errors got injected to the new half of the
mirror when I attached it? For now, I''m going to assume that the new
half of the mirror is OK, hardware wise.
3) I''m running a scrub and zdb on the other pool that lives on these
SAN
boxes, cause I want to see if they come up with the same problems. If
not, what would be going on with this crazy pool.
4) Can I recover from this without copying the whole pool to new
storage? If not, it will be painful for us. We will have to reboot 350
servers and workstations on stale file handles, interrupting 100''s of
production processes. My user base is loosing faith in my team.
Oh sage ones, please advise. Thanks in advance.
Jon