Luke Scharf
2008-Apr-08 21:35 UTC
[zfs-discuss] Device naming weirdness -- possible bug report?
*Platform:* * OpenSolaris snv79 on an older beige-box Intel x86 * Apple XRaid disk box, with 7 JBOD disks * LSI FC controller - http://www.lsi.com/storage_home/products_home/host_bus_adapters/fibre_channel_hbas/lsi7404eplc/index.html?remote=1&locale=EN <http://www.lsi.com/storage_home/products_home/host_bus_adapters/fibre_channel_hbas/lsi7404eplc/index.html?remote=1&locale=EN> *Description:* When a drive is yanked, this happy pool: datapool ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c9t60003930000214EEd0 ONLINE 0 0 0 c9t60003930000214EEd1 ONLINE 0 0 0 c9t60003930000214EEd2 ONLINE 0 0 0 c9t60003930000214EEd3 ONLINE 0 0 0 c9t60003930000214EEd4 ONLINE 0 0 0 c9t60003930000214EEd5 ONLINE 0 0 0 c9t60003930000214EEd6 ONLINE 0 0 0 Turns into this unhappy pool that cannot reflect reality: datapool DEGRADED 0 0 0 raidz1 DEGRADED 0 0 0 c9t60003930000214EEd0 ONLINE 0 0 0 c9t60003930000214EEd1 ONLINE 0 0 0 c9t60003930000214EEd2 ONLINE 0 0 0 c9t60003930000214EEd3 ONLINE 0 0 0 c9t60003930000214EEd4 ONLINE 0 0 0 c9t60003930000214EEd6 FAULTED 0 0 0 corrupted data c9t60003930000214EEd6 ONLINE 0 0 0 Note that c9t60003930000214EEd6, impossibly, appears in _*TWICE*_ in the list! After replacing the disk with a mostly-blank disk (with some leftover zfs headers on it from another experiment), I''m unable to offline of repleace c9t60003930000214EEd5, or generally do anything that would bring the array out of the degraded state. If I export/import the pool, it looks like this: NAME STATE READ WRITE CKSUM datapool DEGRADED 0 0 0 raidz1 DEGRADED 0 0 0 c9t60003930000214EEd0 ONLINE 0 0 0 c9t60003930000214EEd1 ONLINE 0 0 0 c9t60003930000214EEd2 ONLINE 0 0 0 c9t60003930000214EEd3 ONLINE 0 0 0 c9t60003930000214EEd4 ONLINE 0 0 0 6898074116173351320 FAULTED 0 0 0 was /dev/dsk/c9t60003930000214EEd6s0 c9t60003930000214EEd6 ONLINE 0 0 0 errors: No known data errors *Some thoughts:* * Has anyone else seen this? * Having a device in the raidz list twice is clearly a problem! * Being able to change the device list by exporting/importing (without plugging/unplugging any hardware) is clearly a problem, too! * Might the LSI driver or the XRaid re-order the d[0-9] devices when one of them goes away? * We''re thinking of various other ways to expose at this problem: a newer version of OpenSolaris (b85, probably), and blanking drives-used-in-other-experiments more aggressively. Thanks, -Luke