thr3ads.net - zfs discuss - [zfs-discuss] [ZFS-discuss] RAIDZ drive "removed" status [Sep 2009]

If this information is useful, please help other people find it:
Share via:

David Stewart

2009-Sep-29 19:20 UTC

[zfs-discuss] [ZFS-discuss] RAIDZ drive "removed" status

Having casually used IRIX in the past and used BeOS, Windows, and MacOS as
primary OSes, last week I set up a RAIDZ NAS with four Western Digital 1.5TB
drives and copied over data from my WinXP box.  All of the hardware is fresh out
of the box so I did not expect any hardware problems, but when I ran zpool after
a few days of uptime and copying 2.4TB of data to the system I received the
following:

david at opensolarisnas:~$ zpool status mediapool
  pool: mediapool
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
	Sufficient replicas exist for the pool to continue functioning in a
	degraded state.
action: Replace the faulted device, or use ''zpool clear'' to
mark the device
	repaired.
 scrub: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	mediapool   DEGRADED     0     0     0
	  raidz1    DEGRADED     0     0     0
	    c8t0d0  ONLINE       0     0     0
	    c8t1d0  ONLINE       0     0     0
	    c8t2d0  ONLINE       0     0     0
	    c8t3d0  FAULTED      0     0     0  too many errors

errors: No known data errors
david at opensolarisnas:~$

I read the Solaris documentation and it seemed to indicate that I needed to run
zpool clear.

david at opensolarisnas:~$ zpool clear mediapool

And then the fun began.  The system froze and rebooted and I was stuck in a
constant reboot cycle that would get to grub and selecting ?opensolaris-2? and
boot process and crash.  Removing the SATA card that the RAIDZ disks were
attached to would result in a successful boot.  I reinserted the card, went
through a few unsuccessful reboots, and magically it booted all the way for me
to log in.  I then received the following:

media at opensolarisnas:~$ zpool status -v mediapool
  pool: mediapool
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
    invalid.  Sufficient replicas exist for the pool to continue
    functioning in a degraded state.
action: Replace the device using ''zpool replace''.
   see: http://www.sun.com/msg/ZFS-8000-4J
 scrub: scrub in progress for 0h2m, 0.29% done, 16h12m to go
config:

    NAME        STATE     READ WRITE CKSUM
    mediapool   DEGRADED     0     0     0
      raidz1    DEGRADED     0     0     0
        c8t0d0  ONLINE       0     0     0
        c8t1d0  ONLINE       0     0     0
        c8t2d0  ONLINE       0     0     0
        c8t3d0  UNAVAIL      7     0     0  experienced I/O failures

errors: No known data errors
media at opensolarisnas:~$

I shut the machine down and unplugged the power supply and removed the SATA card
and reinserted it, removed each of the SATA cables individually and reinserted
them, removed each of the SATA power cables and reinserted them.  Rebooted:

david at opensolarisnas:~# zpool status -x mediapool
  pool: mediapool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h20m, 2.68% done, 12h29m to go
config:

	NAME        STATE     READ WRITE CKSUM
	mediapool   DEGRADED     0     0     0
	  raidz1    DEGRADED     0     0     0
	    c8t0d0  ONLINE       0     0     0
	    c8t1d0  ONLINE       0     0     0
	    c8t2d0  ONLINE       0     0     0
	    c8t3d0  REMOVED      0     0     0

errors: No known data errors
david at opensolarisnas:~#


The resilvering completed everything seemed fine and I shut the machine down and
rebooted later and went through the same boot & crash cycle that never got
me to the login screen until it finally did get me to that screen for unknown
reasons.  The machine is resilvering currently with the zpool status the same as
above.  What happened, why did it happen, and how can I stop it from happening
again?  Does OpenSolaris believe that c8t3d0 is not connected to the SATA card? 
The SATA card BIOS sees all four drives.  What is the best way for me to figure
out which drive is c8t3d0?  Some operating systems will tell you which drive is
which by telling you the serial number of the drive.  Does OpenSolaris do this? 
If so, how?  I looked through all of the Solaris/OpenSolaris documentation re:
ZFS and RAIDZ for a mention of a ?removed? status for a drive in RAIDZ
configuration, but could not find mention outside of mirrors having this error. 
Page 231 of the OS Bible mentions reattaching a drive in the ?removed? status
from a mirror.  Does this mean physically reattaching the drive (unplugging it
and replugging it in) or does it mean somehow software reattaching it?  If I run
?zpool offline ?t c8t3d0? and reboot and then ?zpool replace mediapool c8t3d0 ?,
then ?zpool online mediapool c8t3d0 ? will this solve all my issues?

There is another issue and I don?t if it is related or not.  If it isn?t
related, I will start another thread.  The size of the RAIDZ1 available space
before I put anything on it was 4TB.  I put ~=2.4TB of data on it.  This is the
size of the data on the WinXP NTFS box, what Nautilus reports and what Disk
Usage Analyzer reports.  And yet ?zfs list? reports I only have 432GB free.  The
Disk Usage Analyzer reports that the filesystem capacity of mediapool is
2905.9GB.  The 2905GB would add correctly with 2.4TB + 432GB, but where did
1.1TB go?  Is this from c8t3d0 being missing?  I did move ~=2TB of data to
mediapool and use Nautilus to delete it.  I saw nothing in the Trash can, so I
am assuming that it has been deleted.  Is this a correct assumption?
	BTW, all of the data is in it?s original state on the WinXP boxes, so if need
be, I can start all over from scratch with the OpenSolaris installation and
RAIDZ1 filesystem.  I am not keen on this as the 2.4TB of data is spread around
and it takes forever to copy 2.4TB of data.

Thanks in advance,

David
-- 
This message posted from opensolaris.org

Trevor Pretty

2009-Sep-29 20:13 UTC

head link

Re: [ZFS-discuss] RAIDZ drive "removed" status

David

The disk is broken! Unlike
other file systems which would silently loose your data ZFS has decide
that this particular disk has "persistent errors"

action: Replace the faulted device, or use ''zpool clear'' to
mark the device repaired.
^^^^^^

It''s clear you are
unsuccessful at repairing it.

Trevor

David Stewart wrote:

Having casually used IRIX in the past and used BeOS, Windows, and MacOS as
primary OSes, last week I set up a RAIDZ NAS with four Western Digital 1.5TB
drives and copied over data from my WinXP box.  All of the hardware is fresh out
of the box so I did not expect any hardware problems, but when I ran zpool after
a few days of uptime and copying 2.4TB of data to the system I received the
following:

david@opensolarisnas:~$ zpool status mediapool
  pool: mediapool
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
	Sufficient replicas exist for the pool to continue functioning in a
	degraded state.
action: Replace the faulted device, or use ''zpool clear'' to
mark the device
	repaired.
 scrub: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	mediapool   DEGRADED     0     0     0
	  raidz1    DEGRADED     0     0     0
	    c8t0d0  ONLINE       0     0     0
	    c8t1d0  ONLINE       0     0     0
	    c8t2d0  ONLINE       0     0     0
	    c8t3d0  FAULTED      0     0     0  too many errors

errors: No known data errors
david@opensolarisnas:~$

I read the Solaris documentation and it seemed to indicate that I needed to run
zpool clear.

david@opensolarisnas:~$ zpool clear mediapool

And then the fun began.  The system froze and rebooted and I was stuck in a
constant reboot cycle that would get to grub and selecting “opensolaris-2” and
boot process and crash.  Removing the SATA card that the RAIDZ disks were
attached to would result in a successful boot.  I reinserted the card, went
through a few unsuccessful reboots, and magically it booted all the way for me
to log in.  I then received the following:

media@opensolarisnas:~$ zpool status -v mediapool
  pool: mediapool
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
    invalid.  Sufficient replicas exist for the pool to continue
    functioning in a degraded state.
action: Replace the device using ''zpool replace''.
   see: http://www.sun.com/msg/ZFS-8000-4J
 scrub: scrub in progress for 0h2m, 0.29% done, 16h12m to go
config:

    NAME        STATE     READ WRITE CKSUM
    mediapool   DEGRADED     0     0     0
      raidz1    DEGRADED     0     0     0
        c8t0d0  ONLINE       0     0     0
        c8t1d0  ONLINE       0     0     0
        c8t2d0  ONLINE       0     0     0
        c8t3d0  UNAVAIL      7     0     0  experienced I/O failures

errors: No known data errors
media@opensolarisnas:~$

I shut the machine down and unplugged the power supply and removed the SATA card
and reinserted it, removed each of the SATA cables individually and reinserted
them, removed each of the SATA power cables and reinserted them.  Rebooted:

david@opensolarisnas:~# zpool status -x mediapool
  pool: mediapool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h20m, 2.68% done, 12h29m to go
config:

	NAME        STATE     READ WRITE CKSUM
	mediapool   DEGRADED     0     0     0
	  raidz1    DEGRADED     0     0     0
	    c8t0d0  ONLINE       0     0     0
	    c8t1d0  ONLINE       0     0     0
	    c8t2d0  ONLINE       0     0     0
	    c8t3d0  REMOVED      0     0     0

errors: No known data errors
david@opensolarisnas:~#

The resilvering completed everything seemed fine and I shut the machine down and
rebooted later and went through the same boot &amp; crash cycle that never
got me to the login screen until it finally did get me to that screen for
unknown reasons.  The machine is resilvering currently with the zpool status the
same as above.  What happened, why did it happen, and how can I stop it from
happening again?  Does OpenSolaris believe that c8t3d0 is not connected to the
SATA card?  The SATA card BIOS sees all four drives.  What is the best way for
me to figure out which drive is c8t3d0?  Some operating systems will tell you
which drive is which by telling you the serial number of the drive.  Does
OpenSolaris do this?  If so, how?  I looked through all of the
Solaris/OpenSolaris documentation re: ZFS and RAIDZ for a mention of a “removed”
status for a drive in RAIDZ configuration, but could not find mention outside of
mirrors having this error.  Page 231 of the OS Bible mentions
reattaching a drive in the “removed” status from a mirror.  Does this mean
physically reattaching the drive (unplugging it and replugging it in) or does it
mean somehow software reattaching it?  If I run “zpool offline –t c8t3d0” and
reboot and then “zpool replace mediapool c8t3d0 “, then “zpool online mediapool
c8t3d0 “ will this solve all my issues?

There is another issue and I don’t if it is related or not.  If it isn’t
related, I will start another thread.  The size of the RAIDZ1 available space
before I put anything on it was 4TB.  I put ~=2.4TB of data on it.  This is the
size of the data on the WinXP NTFS box, what Nautilus reports and what Disk
Usage Analyzer reports.  And yet “zfs list” reports I only have 432GB free.  The
Disk Usage Analyzer reports that the filesystem capacity of mediapool is
2905.9GB.  The 2905GB would add correctly with 2.4TB + 432GB, but where did
1.1TB go?  Is this from c8t3d0 being missing?  I did move ~=2TB of data to
mediapool and use Nautilus to delete it.  I saw nothing in the Trash can, so I
am assuming that it has been deleted.  Is this a correct assumption?
	BTW, all of the data is in it’s original state on the WinXP boxes, so if need
be, I can start all over from scratch with the OpenSolaris installation and
RAIDZ1 filesystem.  I am not keen on this as the 2.4TB of data is spread around
and it takes forever to copy 2.4TB of data.

Thanks in advance,

David

www.eagle.co.nz 

This email is confidential and may be legally 
privileged. If received in error please destroy and immediately notify 
us.


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

David Stewart

2009-Sep-29 20:37 UTC

head link

[zfs-discuss] [ZFS-discuss] RAIDZ drive "removed" status

How do I identify which drive it is?  I hear each drive spinning (I listened to
them individually) so I can''t simply select the one that is not
spinning.

     David
-- 
This message posted from opensolaris.org

Trevor Pretty

2009-Sep-29 20:55 UTC

head link

Re: [ZFS-discuss] RAIDZ drive "removed" status

David

That depends on the hardware layout. If you don''t know and you say the
data is still somewhere else

You could.....

Pull a disk out and see what happens to the pool the one you pulled
will be highlighted as the pool looses all it''s replicas (clear
"should" fix when you plug it back in.)

Or.....

Create a single zpool on each drive and then unplug a drive and see
which zpool dies! 

However..... 

You may not have hot plug drives so if they have a busy light, create a
pool on each drive write a lot of data to each disk pool one at a time
and see which access lights flash.

Or......

Unmount (or destroy) the zpool and power off the machine. Plug in just
one drive and boot. Use format to see which drive appeared. Repeat as
needed... You can also run destructive tests using format on the
suspect drive and see what that thinks.

It is really a good
idea to know which drive is which because they are going to fail! I''m
surprised it''s not on the hardware somewhere, but I tend to play with
hardware from the big three and there is always a label.

Warning: Others have
reported that rebooting system with faulted or degraded ZFS pools can
be "problematic" (you :-)) so be careful not to reboot with a pool in
that state if at all possible.

Trevor

David Stewart wrote:

How do I identify which drive it is?  I hear each drive spinning (I listened to
them individually) so I can''t simply select the one that is not
spinning.

     David

www.eagle.co.nz 

This email is confidential and may be legally 
privileged. If received in error please destroy and immediately notify 
us.


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Marion Hakanson

2009-Sep-29 21:16 UTC

head link

[zfs-discuss] [ZFS-discuss] RAIDZ drive "removed" status

David Stewart wrote:> How do I identify which drive it is?  I hear each drive spinning (I
listened
> to them individually) so I can''t simply select the one that is not
spinning.
You can try reading from each raw device, and looking for a blinky-light
to identify which one is active.  If you don''t have individual lights,
you may be able to hear which one is active.  The "dd" command should
do.

Marion

David Stewart

2009-Sep-29 21:30 UTC

head link

[zfs-discuss] [ZFS-discuss] RAIDZ drive "removed" status

Before I try these options you outlined I do have a question.  I went in to
VMWare Fusion and removed one of the drives from the virtual machine that was
used to create a RAIDZ pool (there were five drives, one for the OS, and four
for the RAIDZ.)  Instead of receiving the "removed" status that I am
getting with the "real" system, I receive "unnavail 0 0 0 cannot
open."  So, do I really need to remove and RMA the drive or is it just not
being recognized by OpenSolaris and can I do something nondestructive to find
and repair the RAIDZ?

I am so not looking forward to moving the 2.4TB of data around again.

David
-- 
This message posted from opensolaris.org

Ross Walker

2009-Sep-29 21:38 UTC

head link

[zfs-discuss] [ZFS-discuss] RAIDZ drive "removed" status

On Tue, Sep 29, 2009 at 5:30 PM, David Stewart <despasadena at onebox.com>
wrote:> Before I try these options you outlined I do have a question. ?I went in to
VMWare Fusion and removed one of the drives from the virtual machine that was
used to create a RAIDZ pool (there were five drives, one for the OS, and four
for the RAIDZ.) ?Instead of receiving the "removed" status that I am
getting with the "real" system, I receive "unnavail 0 0 0 cannot
open." ?So, do I really need to remove and RMA the drive or is it just not
being recognized by OpenSolaris and can I do something nondestructive to find
and repair the RAIDZ?
Yes, follow Marion''s suggestion do a dd if=/dev/rdsk/c8t3d0s0
of=/dev/null bs=4k count=100000 on the failed drive, look for the
activity light to find out which one is faulted, replace it with a
good one and then do a zpool clear.
> I am so not looking forward to moving the 2.4TB of data around again.
That should not be necessary.

-Ross

David Stewart

2009-Sep-29 21:56 UTC

head link

[zfs-discuss] [ZFS-discuss] RAIDZ drive "removed" status

> You can try reading from each raw device, and looking
> for a blinky-light
> to identify which one is active.  If you don''t have
> individual lights,
> you may be able to hear which one is active.  The
> "dd" command should do.
  I received an email from another member (Ross) recommending the same solution
and I tested this out on my VMWare machine.  I''ll give it a try once I
am home on the hardware machine.

When the drive went offline did it reduce the size of the RAIDZ filesystem?  The
amount of space used and free only adds up to ~=2.9TB and not the 4TB that it
should.

   Once again, thanks,


     David
-- 
This message posted from opensolaris.org

Volker A. Brandt

2009-Sep-29 22:01 UTC

head link

[zfs-discuss] [ZFS-discuss] RAIDZ drive "removed" status

> > How do I identify which drive it is?  I hear each drive spinning (I
listened
> > to them individually) so I can''t simply select the one that
is not spinning.
>
> You can try reading from each raw device, and looking for a blinky-light
> to identify which one is active.  If you don''t have individual
lights,
> you may be able to hear which one is active.  The "dd" command
should do.
Write down the serial numbers on your drives.  Then do the following
for all "good" drives (the bad one might hang).  You can recognize
the good ones because format shows the SCSI targets in the initial
disk selection prompt.

Procedure:

- Run "format -e" as root.

- Select the first "good" disk.

- Type "scsi" to get into the mode pages menu (don''t worry
about the
  warning, you won''t do anything to the disks).

- Type "inq" to see the raw inquiry string returned by the disk.
  Somewhere in there is the serial number as an ASCII string.

- Type "q" to get back to the main menu.

- Type "disk" to select the next disk.


This way, you can match serial numbers and "good" disks.  The one
left will be the bad one.


HTH -- Volker
-- 
------------------------------------------------------------------------
Volker A. Brandt                  Consulting and Support for Sun Solaris
Brandt & Brandt Computer GmbH                   WWW: http://www.bb-c.de/
Am Wiesenpfad 6, 53340 Meckenheim                     Email: vab at bb-c.de
Handelsregister: Amtsgericht Bonn, HRB 10513              Schuhgr??e: 45
Gesch?ftsf?hrer: Rainer J. H. Brandt und Volker A. Brandt

zfs discuss - Sep 2009 - RAIDZ drive "removed" status

[zfs-discuss] [ZFS-discuss] RAIDZ drive "removed" status

Re: [ZFS-discuss] RAIDZ drive "removed" status

[zfs-discuss] [ZFS-discuss] RAIDZ drive "removed" status

Re: [ZFS-discuss] RAIDZ drive "removed" status

[zfs-discuss] [ZFS-discuss] RAIDZ drive "removed" status

[zfs-discuss] [ZFS-discuss] RAIDZ drive "removed" status

[zfs-discuss] [ZFS-discuss] RAIDZ drive "removed" status

[zfs-discuss] [ZFS-discuss] RAIDZ drive "removed" status

[zfs-discuss] [ZFS-discuss] RAIDZ drive "removed" status