Jason
2013-Jan-10 07:51 UTC
[zfs-discuss] help zfs pool with duplicated and missing entry of hdd
Hi, One of my server''s zfs faulted and it shows following: NAME STATE READ WRITE CKSUM backup UNAVAIL 0 0 0 insufficient replicas raidz2-0 UNAVAIL 0 0 0 insufficient replicas c4t0d0 ONLINE 0 0 0 c4t0d1 ONLINE 0 0 0 c4t0d0 FAULTED 0 0 0 corrupted data c4t0d3 FAULTED 0 0 0 too many errors c4t0d4 FAULTED 0 0 0 too many errors ...(omit the rest). My question is why c4t0d0 appeared twice, and c4t0d2 is missing. Have check the controller card and hard disk, they are all working fine. Please help how to troubleshooting and what is the main cause of it, how to recover the pool? Thank you.
Jim Klimov
2013-Jan-10 12:25 UTC
[zfs-discuss] help zfs pool with duplicated and missing entry of hdd
On 2013-01-10 08:51, Jason wrote:> Hi, > > One of my server''s zfs faulted and it shows following: > NAME STATE READ WRITE CKSUM > backup UNAVAIL 0 0 0 insufficient replicas > raidz2-0 UNAVAIL 0 0 0 insufficient replicas > c4t0d0 ONLINE 0 0 0 > c4t0d1 ONLINE 0 0 0 > c4t0d0 FAULTED 0 0 0 corrupted data > c4t0d3 FAULTED 0 0 0 too many errors > c4t0d4 FAULTED 0 0 0 too many errors > ...(omit the rest). > > My question is why c4t0d0 appeared twice, and c4t0d2 is missing. > > Have check the controller card and hard disk, they are all working fine.This renaming does seem like an error in detecting (and further naming) of the disks - i.e. if a connector got loose, and one of the disks is not seen by the system, the numbering can shift in such manner. It is indeed strange however that only "d2" got shifted or missing and not all those numbers after it. So, you did verify that the controller sees all the disks in "format" command (and perhaps after a cold reboot - in BIOS)? Just in case, try to unplug and replug all cables (power, data) in case their pins got oxydized over time. HTH, //Jim
Michael Hase
2013-Jan-10 14:03 UTC
[zfs-discuss] help zfs pool with duplicated and missing entry of hdd
On Thu, 10 Jan 2013, Jim Klimov wrote:> On 2013-01-10 08:51, Jason wrote: >> Hi, >> >> One of my server''s zfs faulted and it shows following: >> NAME STATE READ WRITE CKSUM >> backup UNAVAIL 0 0 0 insufficient replicas >> raidz2-0 UNAVAIL 0 0 0 insufficient replicas >> c4t0d0 ONLINE 0 0 0 >> c4t0d1 ONLINE 0 0 0 >> c4t0d0 FAULTED 0 0 0 corrupted data >> c4t0d3 FAULTED 0 0 0 too many errors >> c4t0d4 FAULTED 0 0 0 too many errors >> ...(omit the rest). >> >> My question is why c4t0d0 appeared twice, and c4t0d2 is missing. >> >> Have check the controller card and hard disk, they are all working fine. > > This renaming does seem like an error in detecting (and further naming) > of the disks - i.e. if a connector got loose, and one of the disks is > not seen by the system, the numbering can shift in such manner. It is > indeed strange however that only "d2" got shifted or missing and not > all those numbers after it. > > So, you did verify that the controller sees all the disks in "format" > command (and perhaps after a cold reboot - in BIOS)? Just in case, try > to unplug and replug all cables (power, data) in case their pins got > oxydized over time.Usually the disk numbering in any solaris based os stays the same if one disk is offline/missing, it''s fixed to the controller port, or scsi target, or wwn. Imho a huge advantage of the c0t0d0 pattern, instead of the linux or freebsd numbering. I once had an old sun 5200 hooked up to a linux box and one of the 22 disks failed, every disk after the bad one had shifted, what a mess. To me the c4t0d0, c4t0d1, ... numbering looks either like a hardware raid controller not in jbod mode, or even an external san. jbods normally show up as lun 0 (d0) with different target numbers (t1, t2, ...). Maybe something wrong with lun numbering on your box? -- Michael