thr3ads.net - zfs discuss - [zfs-discuss] zpool scrub on b123 [Apr 2011]

If this information is useful, please help other people find it:
Share via:

Karl Rossing

2011-Apr-15 18:52 UTC

[zfs-discuss] zpool scrub on b123

Hi,

One of our zfs volumes seems to be having some errors. So I ran zpool 
scrub and it''s currently showing the following.

-bash-3.2$ pfexec /usr/sbin/zpool status -x
   pool: vdipool
  state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
         attempt was made to correct the error.  Applications are 
unaffected.
action: Determine if the device needs to be replaced, and clear the errors
         using ''zpool clear'' or replace the device with
''zpool replace''.
    see: http://www.sun.com/msg/ZFS-8000-9P
  scrub: scrub in progress for 3h10m, 13.53% done, 20h16m to go
config:

         NAME         STATE     READ WRITE CKSUM
         vdipool      ONLINE       0     0     0
           raidz1     ONLINE       0     0     0
             c9t14d0  ONLINE       0     0    12  6K repaired
             c9t15d0  ONLINE       0     0    13  167K repaired
             c9t16d0  ONLINE       0     0    11  5.50K repaired
             c9t17d0  ONLINE       0     0    20  10K repaired
             c9t18d0  ONLINE       0     0    15  7.50K repaired
         spares
           c9t19d0    AVAIL

errors: No known data errors


I have another server connected to the same jbod using drives c8t1d0 to 
c8t13d0 and it doesn''t seem to have any errors.

I''m wondering how it could have gotten so screwed up?

Karl





CONFIDENTIALITY NOTICE:  This communication (including all attachments) is
confidential and is intended for the use of the named addressee(s) only and
may contain information that is private, confidential, privileged, and
exempt from disclosure under law.  All rights to privilege are expressly
claimed and reserved and are not waived.  Any use, dissemination,
distribution, copying or disclosure of this message and any attachments, in
whole or in part, by anyone other than the intended recipient(s) is strictly
prohibited.  If you have received this communication in error, please notify
the sender immediately, delete this communication from all data storage
devices and destroy all hard copies.

Cindy Swearingen

2011-Apr-15 19:18 UTC

head link

[zfs-discuss] zpool scrub on b123

Hi Karl...

I just saw this same condition on another list. I think the poster
resolved it by replacing the HBA.

Drives go bad but they generally don''t all go bad at once, so I would
suspect some common denominator like the HBA/controller, cables, and
so on.

See what FMA thinks by running fmdump like this:

# fmdump
TIME                 UUID                                 SUNW-MSG-ID
Apr 11 16:02:38.2262 ed0bdffe-3cf9-6f46-f20c-99e2b9a6f1cb ZFS-8000-D3
Apr 11 16:22:23.8401 d4157e2f-c46d-c1e9-c05b-f2d3e57f3893 ZFS-8000-D3
Apr 14 15:55:26.1918 71bd0b08-60c2-e114-e1bc-daa03d7b163f ZFS-8000-D3

This output will tell you when the problem started.

Depending on what fmdump says, which probably indicates multiple drive
problems, I would run diagnostics on the HBA or get it replaced.

Always have good backups.

Thanks,

Cindy


On 04/15/11 12:52, Karl Rossing wrote:> Hi,
> 
> One of our zfs volumes seems to be having some errors. So I ran zpool 
> scrub and it''s currently showing the following.
> 
> -bash-3.2$ pfexec /usr/sbin/zpool status -x
>   pool: vdipool
>  state: ONLINE
> status: One or more devices has experienced an unrecoverable error.  An
>         attempt was made to correct the error.  Applications are 
> unaffected.
> action: Determine if the device needs to be replaced, and clear the errors
>         using ''zpool clear'' or replace the device with
''zpool replace''.
>    see: http://www.sun.com/msg/ZFS-8000-9P
>  scrub: scrub in progress for 3h10m, 13.53% done, 20h16m to go
> config:
> 
>         NAME         STATE     READ WRITE CKSUM
>         vdipool      ONLINE       0     0     0
>           raidz1     ONLINE       0     0     0
>             c9t14d0  ONLINE       0     0    12  6K repaired
>             c9t15d0  ONLINE       0     0    13  167K repaired
>             c9t16d0  ONLINE       0     0    11  5.50K repaired
>             c9t17d0  ONLINE       0     0    20  10K repaired
>             c9t18d0  ONLINE       0     0    15  7.50K repaired
>         spares
>           c9t19d0    AVAIL
> 
> errors: No known data errors
> 
> 
> I have another server connected to the same jbod using drives c8t1d0 to 
> c8t13d0 and it doesn''t seem to have any errors.
> 
> I''m wondering how it could have gotten so screwed up?
> 
> Karl
> 
> 
> 
> 
> 
> CONFIDENTIALITY NOTICE:  This communication (including all attachments) is
> confidential and is intended for the use of the named addressee(s) only and
> may contain information that is private, confidential, privileged, and
> exempt from disclosure under law.  All rights to privilege are expressly
> claimed and reserved and are not waived.  Any use, dissemination,
> distribution, copying or disclosure of this message and any attachments, in
> whole or in part, by anyone other than the intended recipient(s) is 
> strictly
> prohibited.  If you have received this communication in error, please 
> notify
> the sender immediately, delete this communication from all data storage
> devices and destroy all hard copies.
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Cindy Swearingen

2011-Apr-15 19:23 UTC

head link

[zfs-discuss] zpool scrub on b123

D''oh. One more thing.

We had a problem in b120-123 that caused random checksum errors on RAIDZ 
configs. This info is still in the ZFS troubleshooting guide.

See if a zpool clear resolves these errors. If that works, then I would
upgrade to a more recent build and see if the problem is resolved
completely.

If not, then see the recommendation below.

Thanks,

Cindy

On 04/15/11 13:18, Cindy Swearingen wrote:> Hi Karl...
> 
> I just saw this same condition on another list. I think the poster
> resolved it by replacing the HBA.
> 
> Drives go bad but they generally don''t all go bad at once, so I
would
> suspect some common denominator like the HBA/controller, cables, and
> so on.
> 
> See what FMA thinks by running fmdump like this:
> 
> # fmdump
> TIME                 UUID                                 SUNW-MSG-ID
> Apr 11 16:02:38.2262 ed0bdffe-3cf9-6f46-f20c-99e2b9a6f1cb ZFS-8000-D3
> Apr 11 16:22:23.8401 d4157e2f-c46d-c1e9-c05b-f2d3e57f3893 ZFS-8000-D3
> Apr 14 15:55:26.1918 71bd0b08-60c2-e114-e1bc-daa03d7b163f ZFS-8000-D3
> 
> This output will tell you when the problem started.
> 
> Depending on what fmdump says, which probably indicates multiple drive
> problems, I would run diagnostics on the HBA or get it replaced.
> 
> Always have good backups.
> 
> Thanks,
> 
> Cindy
> 
> 
> On 04/15/11 12:52, Karl Rossing wrote:
>> Hi,
>>
>> One of our zfs volumes seems to be having some errors. So I ran zpool 
>> scrub and it''s currently showing the following.
>>
>> -bash-3.2$ pfexec /usr/sbin/zpool status -x
>>   pool: vdipool
>>  state: ONLINE
>> status: One or more devices has experienced an unrecoverable error.  An
>>         attempt was made to correct the error.  Applications are 
>> unaffected.
>> action: Determine if the device needs to be replaced, and clear the 
>> errors
>>         using ''zpool clear'' or replace the device
with ''zpool replace''.
>>    see: http://www.sun.com/msg/ZFS-8000-9P
>>  scrub: scrub in progress for 3h10m, 13.53% done, 20h16m to go
>> config:
>>
>>         NAME         STATE     READ WRITE CKSUM
>>         vdipool      ONLINE       0     0     0
>>           raidz1     ONLINE       0     0     0
>>             c9t14d0  ONLINE       0     0    12  6K repaired
>>             c9t15d0  ONLINE       0     0    13  167K repaired
>>             c9t16d0  ONLINE       0     0    11  5.50K repaired
>>             c9t17d0  ONLINE       0     0    20  10K repaired
>>             c9t18d0  ONLINE       0     0    15  7.50K repaired
>>         spares
>>           c9t19d0    AVAIL
>>
>> errors: No known data errors
>>
>>
>> I have another server connected to the same jbod using drives c8t1d0 
>> to c8t13d0 and it doesn''t seem to have any errors.
>>
>> I''m wondering how it could have gotten so screwed up?
>>
>> Karl
>>
>>
>>
>>
>>
>> CONFIDENTIALITY NOTICE:  This communication (including all 
>> attachments) is
>> confidential and is intended for the use of the named addressee(s) 
>> only and
>> may contain information that is private, confidential, privileged, and
>> exempt from disclosure under law.  All rights to privilege are
expressly
>> claimed and reserved and are not waived.  Any use, dissemination,
>> distribution, copying or disclosure of this message and any 
>> attachments, in
>> whole or in part, by anyone other than the intended recipient(s) is 
>> strictly
>> prohibited.  If you have received this communication in error, please 
>> notify
>> the sender immediately, delete this communication from all data storage
>> devices and destroy all hard copies.
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Karl Rossing

2011-Apr-15 20:13 UTC

head link

[zfs-discuss] zpool scrub on b123

I''m going to wait until the scrub is complete before diving in some
more.

I''m wondering if replacing the LSI SAS 3801E with an LSI SAS 9200-8e 
might help too.

Karl

On 04/15/2011 02:23 PM, Cindy Swearingen wrote:> D''oh. One more thing.
>
> We had a problem in b120-123 that caused random checksum errors on 
> RAIDZ configs. This info is still in the ZFS troubleshooting guide.
>
> See if a zpool clear resolves these errors. If that works, then I would
> upgrade to a more recent build and see if the problem is resolved
> completely.
>
> If not, then see the recommendation below.
>
> Thanks,
>
> Cindy
>
> On 04/15/11 13:18, Cindy Swearingen wrote:
>> Hi Karl...
>>
>> I just saw this same condition on another list. I think the poster
>> resolved it by replacing the HBA.
>>
>> Drives go bad but they generally don''t all go bad at once, so
I would
>> suspect some common denominator like the HBA/controller, cables, and
>> so on.
>>
>> See what FMA thinks by running fmdump like this:
>>
>> # fmdump
>> TIME                 UUID                                 SUNW-MSG-ID
>> Apr 11 16:02:38.2262 ed0bdffe-3cf9-6f46-f20c-99e2b9a6f1cb ZFS-8000-D3
>> Apr 11 16:22:23.8401 d4157e2f-c46d-c1e9-c05b-f2d3e57f3893 ZFS-8000-D3
>> Apr 14 15:55:26.1918 71bd0b08-60c2-e114-e1bc-daa03d7b163f ZFS-8000-D3
>>
>> This output will tell you when the problem started.
>>
>> Depending on what fmdump says, which probably indicates multiple drive
>> problems, I would run diagnostics on the HBA or get it replaced.
>>
>> Always have good backups.
>>
>> Thanks,
>>
>> Cindy
>>
>>
>> On 04/15/11 12:52, Karl Rossing wrote:
>>> Hi,
>>>
>>> One of our zfs volumes seems to be having some errors. So I ran 
>>> zpool scrub and it''s currently showing the following.
>>>
>>> -bash-3.2$ pfexec /usr/sbin/zpool status -x
>>>   pool: vdipool
>>>  state: ONLINE
>>> status: One or more devices has experienced an unrecoverable error.
An
>>>         attempt was made to correct the error.  Applications are 
>>> unaffected.
>>> action: Determine if the device needs to be replaced, and clear the
>>> errors
>>>         using ''zpool clear'' or replace the device
with ''zpool replace''.
>>>    see: http://www.sun.com/msg/ZFS-8000-9P
>>>  scrub: scrub in progress for 3h10m, 13.53% done, 20h16m to go
>>> config:
>>>
>>>         NAME         STATE     READ WRITE CKSUM
>>>         vdipool      ONLINE       0     0     0
>>>           raidz1     ONLINE       0     0     0
>>>             c9t14d0  ONLINE       0     0    12  6K repaired
>>>             c9t15d0  ONLINE       0     0    13  167K repaired
>>>             c9t16d0  ONLINE       0     0    11  5.50K repaired
>>>             c9t17d0  ONLINE       0     0    20  10K repaired
>>>             c9t18d0  ONLINE       0     0    15  7.50K repaired
>>>         spares
>>>           c9t19d0    AVAIL
>>>
>>> errors: No known data errors
>>>
>>>
>>> I have another server connected to the same jbod using drives
c8t1d0
>>> to c8t13d0 and it doesn''t seem to have any errors.
>>>
>>> I''m wondering how it could have gotten so screwed up?
>>>
>>> Karl
>>>
>>>
>>>
>>>
>>>
>>> CONFIDENTIALITY NOTICE:  This communication (including all 
>>> attachments) is
>>> confidential and is intended for the use of the named addressee(s) 
>>> only and
>>> may contain information that is private, confidential, privileged,
and
>>> exempt from disclosure under law.  All rights to privilege are 
>>> expressly
>>> claimed and reserved and are not waived.  Any use, dissemination,
>>> distribution, copying or disclosure of this message and any 
>>> attachments, in
>>> whole or in part, by anyone other than the intended recipient(s) is
>>> strictly
>>> prohibited.  If you have received this communication in error, 
>>> please notify
>>> the sender immediately, delete this communication from all data
storage
>>> devices and destroy all hard copies.
>>> _______________________________________________
>>> zfs-discuss mailing list
>>> zfs-discuss at opensolaris.org
>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


CONFIDENTIALITY NOTICE:  This communication (including all attachments) is
confidential and is intended for the use of the named addressee(s) only and
may contain information that is private, confidential, privileged, and
exempt from disclosure under law.  All rights to privilege are expressly
claimed and reserved and are not waived.  Any use, dissemination,
distribution, copying or disclosure of this message and any attachments, in
whole or in part, by anyone other than the intended recipient(s) is strictly
prohibited.  If you have received this communication in error, please notify
the sender immediately, delete this communication from all data storage
devices and destroy all hard copies.

Karl Rossing

2011-Apr-15 20:47 UTC

head link

[zfs-discuss] zpool scrub on b123

Would moving the pool to a Solaris 10U9 server fix the random RAIDZ errors?

On 04/15/2011 02:23 PM, Cindy Swearingen wrote:> D''oh. One more thing.
>
> We had a problem in b120-123 that caused random checksum errors on 
> RAIDZ configs. This info is still in the ZFS troubleshooting guide.
>
> See if a zpool clear resolves these errors. If that works, then I would
> upgrade to a more recent build and see if the problem is resolved
> completely.
>
> If not, then see the recommendation below.
>
> Thanks,
>
> Cindy
>
> On 04/15/11 13:18, Cindy Swearingen wrote:
>> Hi Karl...
>>
>> I just saw this same condition on another list. I think the poster
>> resolved it by replacing the HBA.
>>
>> Drives go bad but they generally don''t all go bad at once, so
I would
>> suspect some common denominator like the HBA/controller, cables, and
>> so on.
>>
>> See what FMA thinks by running fmdump like this:
>>
>> # fmdump
>> TIME                 UUID                                 SUNW-MSG-ID
>> Apr 11 16:02:38.2262 ed0bdffe-3cf9-6f46-f20c-99e2b9a6f1cb ZFS-8000-D3
>> Apr 11 16:22:23.8401 d4157e2f-c46d-c1e9-c05b-f2d3e57f3893 ZFS-8000-D3
>> Apr 14 15:55:26.1918 71bd0b08-60c2-e114-e1bc-daa03d7b163f ZFS-8000-D3
>>
>> This output will tell you when the problem started.
>>
>> Depending on what fmdump says, which probably indicates multiple drive
>> problems, I would run diagnostics on the HBA or get it replaced.
>>
>> Always have good backups.
>>
>> Thanks,
>>
>> Cindy
>>
>>
>> On 04/15/11 12:52, Karl Rossing wrote:
>>> Hi,
>>>
>>> One of our zfs volumes seems to be having some errors. So I ran 
>>> zpool scrub and it''s currently showing the following.
>>>
>>> -bash-3.2$ pfexec /usr/sbin/zpool status -x
>>>   pool: vdipool
>>>  state: ONLINE
>>> status: One or more devices has experienced an unrecoverable error.
An
>>>         attempt was made to correct the error.  Applications are 
>>> unaffected.
>>> action: Determine if the device needs to be replaced, and clear the
>>> errors
>>>         using ''zpool clear'' or replace the device
with ''zpool replace''.
>>>    see: http://www.sun.com/msg/ZFS-8000-9P
>>>  scrub: scrub in progress for 3h10m, 13.53% done, 20h16m to go
>>> config:
>>>
>>>         NAME         STATE     READ WRITE CKSUM
>>>         vdipool      ONLINE       0     0     0
>>>           raidz1     ONLINE       0     0     0
>>>             c9t14d0  ONLINE       0     0    12  6K repaired
>>>             c9t15d0  ONLINE       0     0    13  167K repaired
>>>             c9t16d0  ONLINE       0     0    11  5.50K repaired
>>>             c9t17d0  ONLINE       0     0    20  10K repaired
>>>             c9t18d0  ONLINE       0     0    15  7.50K repaired
>>>         spares
>>>           c9t19d0    AVAIL
>>>
>>> errors: No known data errors
>>>
>>>
>>> I have another server connected to the same jbod using drives
c8t1d0
>>> to c8t13d0 and it doesn''t seem to have any errors.
>>>
>>> I''m wondering how it could have gotten so screwed up?
>>>
>>> Karl
>>>
>>>
>>>
>>>
>>>
>>> CONFIDENTIALITY NOTICE:  This communication (including all 
>>> attachments) is
>>> confidential and is intended for the use of the named addressee(s) 
>>> only and
>>> may contain information that is private, confidential, privileged,
and
>>> exempt from disclosure under law.  All rights to privilege are 
>>> expressly
>>> claimed and reserved and are not waived.  Any use, dissemination,
>>> distribution, copying or disclosure of this message and any 
>>> attachments, in
>>> whole or in part, by anyone other than the intended recipient(s) is
>>> strictly
>>> prohibited.  If you have received this communication in error, 
>>> please notify
>>> the sender immediately, delete this communication from all data
storage
>>> devices and destroy all hard copies.
>>> _______________________________________________
>>> zfs-discuss mailing list
>>> zfs-discuss at opensolaris.org
>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


CONFIDENTIALITY NOTICE:  This communication (including all attachments) is
confidential and is intended for the use of the named addressee(s) only and
may contain information that is private, confidential, privileged, and
exempt from disclosure under law.  All rights to privilege are expressly
claimed and reserved and are not waived.  Any use, dissemination,
distribution, copying or disclosure of this message and any attachments, in
whole or in part, by anyone other than the intended recipient(s) is strictly
prohibited.  If you have received this communication in error, please notify
the sender immediately, delete this communication from all data storage
devices and destroy all hard copies.

Cindy Swearingen

2011-Apr-15 20:55 UTC

head link

[zfs-discuss] zpool scrub on b123

Yes, the Solaris 10 9/10 release has the fix for RAIDZ checksum errors
if you have ruled out any hardware problems.

cs
On 04/15/11 14:47, Karl Rossing wrote:> Would moving the pool to a Solaris 10U9 server fix the random RAIDZ errors?
> 
> On 04/15/2011 02:23 PM, Cindy Swearingen wrote:
>> D''oh. One more thing.
>>
>> We had a problem in b120-123 that caused random checksum errors on 
>> RAIDZ configs. This info is still in the ZFS troubleshooting guide.
>>
>> See if a zpool clear resolves these errors. If that works, then I would
>> upgrade to a more recent build and see if the problem is resolved
>> completely.
>>
>> If not, then see the recommendation below.
>>
>> Thanks,
>>
>> Cindy
>>
>> On 04/15/11 13:18, Cindy Swearingen wrote:
>>> Hi Karl...
>>>
>>> I just saw this same condition on another list. I think the poster
>>> resolved it by replacing the HBA.
>>>
>>> Drives go bad but they generally don''t all go bad at once,
so I would
>>> suspect some common denominator like the HBA/controller, cables,
and
>>> so on.
>>>
>>> See what FMA thinks by running fmdump like this:
>>>
>>> # fmdump
>>> TIME                 UUID                                
SUNW-MSG-ID
>>> Apr 11 16:02:38.2262 ed0bdffe-3cf9-6f46-f20c-99e2b9a6f1cb
ZFS-8000-D3
>>> Apr 11 16:22:23.8401 d4157e2f-c46d-c1e9-c05b-f2d3e57f3893
ZFS-8000-D3
>>> Apr 14 15:55:26.1918 71bd0b08-60c2-e114-e1bc-daa03d7b163f
ZFS-8000-D3
>>>
>>> This output will tell you when the problem started.
>>>
>>> Depending on what fmdump says, which probably indicates multiple
drive
>>> problems, I would run diagnostics on the HBA or get it replaced.
>>>
>>> Always have good backups.
>>>
>>> Thanks,
>>>
>>> Cindy
>>>
>>>
>>> On 04/15/11 12:52, Karl Rossing wrote:
>>>> Hi,
>>>>
>>>> One of our zfs volumes seems to be having some errors. So I ran
>>>> zpool scrub and it''s currently showing the following.
>>>>
>>>> -bash-3.2$ pfexec /usr/sbin/zpool status -x
>>>>   pool: vdipool
>>>>  state: ONLINE
>>>> status: One or more devices has experienced an unrecoverable
error.  An
>>>>         attempt was made to correct the error.  Applications
are
>>>> unaffected.
>>>> action: Determine if the device needs to be replaced, and clear
the
>>>> errors
>>>>         using ''zpool clear'' or replace the
device with ''zpool replace''.
>>>>    see: http://www.sun.com/msg/ZFS-8000-9P
>>>>  scrub: scrub in progress for 3h10m, 13.53% done, 20h16m to go
>>>> config:
>>>>
>>>>         NAME         STATE     READ WRITE CKSUM
>>>>         vdipool      ONLINE       0     0     0
>>>>           raidz1     ONLINE       0     0     0
>>>>             c9t14d0  ONLINE       0     0    12  6K repaired
>>>>             c9t15d0  ONLINE       0     0    13  167K repaired
>>>>             c9t16d0  ONLINE       0     0    11  5.50K repaired
>>>>             c9t17d0  ONLINE       0     0    20  10K repaired
>>>>             c9t18d0  ONLINE       0     0    15  7.50K repaired
>>>>         spares
>>>>           c9t19d0    AVAIL
>>>>
>>>> errors: No known data errors
>>>>
>>>>
>>>> I have another server connected to the same jbod using drives
c8t1d0
>>>> to c8t13d0 and it doesn''t seem to have any errors.
>>>>
>>>> I''m wondering how it could have gotten so screwed up?
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> CONFIDENTIALITY NOTICE:  This communication (including all 
>>>> attachments) is
>>>> confidential and is intended for the use of the named
addressee(s)
>>>> only and
>>>> may contain information that is private, confidential,
privileged, and
>>>> exempt from disclosure under law.  All rights to privilege are 
>>>> expressly
>>>> claimed and reserved and are not waived.  Any use,
dissemination,
>>>> distribution, copying or disclosure of this message and any 
>>>> attachments, in
>>>> whole or in part, by anyone other than the intended
recipient(s) is
>>>> strictly
>>>> prohibited.  If you have received this communication in error, 
>>>> please notify
>>>> the sender immediately, delete this communication from all data
storage
>>>> devices and destroy all hard copies.
>>>> _______________________________________________
>>>> zfs-discuss mailing list
>>>> zfs-discuss at opensolaris.org
>>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>> _______________________________________________
>>> zfs-discuss mailing list
>>> zfs-discuss at opensolaris.org
>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 
> 
> 
> CONFIDENTIALITY NOTICE:  This communication (including all attachments) is
> confidential and is intended for the use of the named addressee(s) only and
> may contain information that is private, confidential, privileged, and
> exempt from disclosure under law.  All rights to privilege are expressly
> claimed and reserved and are not waived.  Any use, dissemination,
> distribution, copying or disclosure of this message and any attachments, in
> whole or in part, by anyone other than the intended recipient(s) is 
> strictly
> prohibited.  If you have received this communication in error, please 
> notify
> the sender immediately, delete this communication from all data storage
> devices and destroy all hard copies.
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Nathan Kroenert

2011-Apr-16 06:51 UTC

head link

[zfs-discuss] zpool scrub on b123

Hi Karl,

Is there any chance at all that some other system is writing to the 
drives in this pool? You say other things are writing to the same JBOD...

Given that the amount flagged as corrupt is so small, I''d imagine not, 
but thought I''d ask the question anyways.

Cheers!

Nathan.

On 04/16/11 04:52 AM, Karl Rossing wrote:> Hi,
>
> One of our zfs volumes seems to be having some errors. So I ran zpool 
> scrub and it''s currently showing the following.
>
> -bash-3.2$ pfexec /usr/sbin/zpool status -x
>   pool: vdipool
>  state: ONLINE
> status: One or more devices has experienced an unrecoverable error.  An
>         attempt was made to correct the error.  Applications are 
> unaffected.
> action: Determine if the device needs to be replaced, and clear the 
> errors
>         using ''zpool clear'' or replace the device with
''zpool replace''.
>    see: http://www.sun.com/msg/ZFS-8000-9P
>  scrub: scrub in progress for 3h10m, 13.53% done, 20h16m to go
> config:
>
>         NAME         STATE     READ WRITE CKSUM
>         vdipool      ONLINE       0     0     0
>           raidz1     ONLINE       0     0     0
>             c9t14d0  ONLINE       0     0    12  6K repaired
>             c9t15d0  ONLINE       0     0    13  167K repaired
>             c9t16d0  ONLINE       0     0    11  5.50K repaired
>             c9t17d0  ONLINE       0     0    20  10K repaired
>             c9t18d0  ONLINE       0     0    15  7.50K repaired
>         spares
>           c9t19d0    AVAIL
>
> errors: No known data errors
>
>
> I have another server connected to the same jbod using drives c8t1d0 
> to c8t13d0 and it doesn''t seem to have any errors.
>
> I''m wondering how it could have gotten so screwed up?
>
> Karl
>
>
>
>
>
> CONFIDENTIALITY NOTICE:  This communication (including all 
> attachments) is
> confidential and is intended for the use of the named addressee(s) 
> only and
> may contain information that is private, confidential, privileged, and
> exempt from disclosure under law.  All rights to privilege are expressly
> claimed and reserved and are not waived.  Any use, dissemination,
> distribution, copying or disclosure of this message and any 
> attachments, in
> whole or in part, by anyone other than the intended recipient(s) is 
> strictly
> prohibited.  If you have received this communication in error, please 
> notify
> the sender immediately, delete this communication from all data storage
> devices and destroy all hard copies.
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Roy Sigurd Karlsbakk

2011-Apr-16 15:45 UTC

head link

[zfs-discuss] zpool scrub on b123

> I''m going to wait until the scrub is complete before diving in
some
> more.
> 
> I''m wondering if replacing the LSI SAS 3801E with an LSI SAS
9200-8e
> might help too.
I''ve seen similar errors with 3801 - seems to be SAS timeouts. Reboot
the box and it''ll probably work well again for a while. I replaced the
3801 with a 9200 and the problem was gone.

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
roy at karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er
et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og
relevante synonymer p? norsk.

Karl Rossing

2011-Apr-18 20:08 UTC

head link

[zfs-discuss] zpool scrub on b123

So i figured out after a couple of scrubs and fmadm faulty that drive 
c9t15d0 was bad.

I then replaced the drive using

-bash-3.2$  pfexec /usr/sbin/zpool offline vdipool c9t15d0
-bash-3.2$  pfexec /usr/sbin/zpool replace vdipool c9t15d0 c9t19d0

The drive resilvered and I rebooted the server, just to make sure 
everything was clean.

After the reboot, zfs resilvred the same drive again(which took 7hrs)

My pool now looks like this:
         NAME           STATE     READ WRITE CKSUM
         vdipool        DEGRADED     0     0     2
           raidz1       DEGRADED     0     0     4
             c9t14d0    ONLINE       0     0     1  512 resilvered
             spare      DEGRADED     0     0     0
               c9t15d0  OFFLINE      0     0     0
               c9t19d0  ONLINE       0     0     0  16.1G resilvered
             c9t16d0    ONLINE       0     0     1  512 resilvered
             c9t17d0    ONLINE       0     0     5  2.50K resilvered
             c9t18d0    ONLINE       0     0     1  512 resilvered
         spares
           c9t19d0      INUSE     currently in use

I''m going to replace c9t15d0 with a new drive.

I find it odd that zfs needed to resilver the drive after the reboot. 
Shouldn''t the resilvered information be kept across reboots?

Thanks
Karl

On 04/15/2011 03:55 PM, Cindy Swearingen wrote:> Yes, the Solaris 10 9/10 release has the fix for RAIDZ checksum errors
> if you have ruled out any hardware problems.
>
> cs
> On 04/15/11 14:47, Karl Rossing wrote:
>> Would moving the pool to a Solaris 10U9 server fix the random RAIDZ 
>> errors?
>>
>> On 04/15/2011 02:23 PM, Cindy Swearingen wrote:
>>> D''oh. One more thing.
>>>
>>> We had a problem in b120-123 that caused random checksum errors on 
>>> RAIDZ configs. This info is still in the ZFS troubleshooting guide.
>>>
>>> See if a zpool clear resolves these errors. If that works, then I
would
>>> upgrade to a more recent build and see if the problem is resolved
>>> completely.
>>>
>>> If not, then see the recommendation below.
>>>
>>> Thanks,
>>>
>>> Cindy
>>>
>>> On 04/15/11 13:18, Cindy Swearingen wrote:
>>>> Hi Karl...
>>>>
>>>> I just saw this same condition on another list. I think the
poster
>>>> resolved it by replacing the HBA.
>>>>
>>>> Drives go bad but they generally don''t all go bad at
once, so I would
>>>> suspect some common denominator like the HBA/controller,
cables, and
>>>> so on.
>>>>
>>>> See what FMA thinks by running fmdump like this:
>>>>
>>>> # fmdump
>>>> TIME                 UUID                                
SUNW-MSG-ID
>>>> Apr 11 16:02:38.2262 ed0bdffe-3cf9-6f46-f20c-99e2b9a6f1cb
ZFS-8000-D3
>>>> Apr 11 16:22:23.8401 d4157e2f-c46d-c1e9-c05b-f2d3e57f3893
ZFS-8000-D3
>>>> Apr 14 15:55:26.1918 71bd0b08-60c2-e114-e1bc-daa03d7b163f
ZFS-8000-D3
>>>>
>>>> This output will tell you when the problem started.
>>>>
>>>> Depending on what fmdump says, which probably indicates
multiple drive
>>>> problems, I would run diagnostics on the HBA or get it
replaced.
>>>>
>>>> Always have good backups.
>>>>
>>>> Thanks,
>>>>
>>>> Cindy
>>>>
>>>>
>>>> On 04/15/11 12:52, Karl Rossing wrote:
>>>>> Hi,
>>>>>
>>>>> One of our zfs volumes seems to be having some errors. So I
ran
>>>>> zpool scrub and it''s currently showing the
following.
>>>>>
>>>>> -bash-3.2$ pfexec /usr/sbin/zpool status -x
>>>>>   pool: vdipool
>>>>>  state: ONLINE
>>>>> status: One or more devices has experienced an
unrecoverable
>>>>> error.  An
>>>>>         attempt was made to correct the error. 
Applications are
>>>>> unaffected.
>>>>> action: Determine if the device needs to be replaced, and
clear
>>>>> the errors
>>>>>         using ''zpool clear'' or replace
the device with ''zpool
>>>>> replace''.
>>>>>    see: http://www.sun.com/msg/ZFS-8000-9P
>>>>>  scrub: scrub in progress for 3h10m, 13.53% done, 20h16m to
go
>>>>> config:
>>>>>
>>>>>         NAME         STATE     READ WRITE CKSUM
>>>>>         vdipool      ONLINE       0     0     0
>>>>>           raidz1     ONLINE       0     0     0
>>>>>             c9t14d0  ONLINE       0     0    12  6K
repaired
>>>>>             c9t15d0  ONLINE       0     0    13  167K
repaired
>>>>>             c9t16d0  ONLINE       0     0    11  5.50K
repaired
>>>>>             c9t17d0  ONLINE       0     0    20  10K
repaired
>>>>>             c9t18d0  ONLINE       0     0    15  7.50K
repaired
>>>>>         spares
>>>>>           c9t19d0    AVAIL
>>>>>
>>>>> errors: No known data errors
>>>>>
>>>>>
>>>>> I have another server connected to the same jbod using
drives
>>>>> c8t1d0 to c8t13d0 and it doesn''t seem to have any
errors.
>>>>>
>>>>> I''m wondering how it could have gotten so screwed
up?
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> CONFIDENTIALITY NOTICE:  This communication (including all 
>>>>> attachments) is
>>>>> confidential and is intended for the use of the named
addressee(s)
>>>>> only and
>>>>> may contain information that is private, confidential,
privileged,
>>>>> and
>>>>> exempt from disclosure under law.  All rights to privilege
are
>>>>> expressly
>>>>> claimed and reserved and are not waived.  Any use,
dissemination,
>>>>> distribution, copying or disclosure of this message and any
>>>>> attachments, in
>>>>> whole or in part, by anyone other than the intended
recipient(s)
>>>>> is strictly
>>>>> prohibited.  If you have received this communication in
error,
>>>>> please notify
>>>>> the sender immediately, delete this communication from all
data
>>>>> storage
>>>>> devices and destroy all hard copies.
>>>>> _______________________________________________
>>>>> zfs-discuss mailing list
>>>>> zfs-discuss at opensolaris.org
>>>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>>> _______________________________________________
>>>> zfs-discuss mailing list
>>>> zfs-discuss at opensolaris.org
>>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
>>
>>
>> CONFIDENTIALITY NOTICE:  This communication (including all 
>> attachments) is
>> confidential and is intended for the use of the named addressee(s) 
>> only and
>> may contain information that is private, confidential, privileged, and
>> exempt from disclosure under law.  All rights to privilege are
expressly
>> claimed and reserved and are not waived.  Any use, dissemination,
>> distribution, copying or disclosure of this message and any 
>> attachments, in
>> whole or in part, by anyone other than the intended recipient(s) is 
>> strictly
>> prohibited.  If you have received this communication in error, please 
>> notify
>> the sender immediately, delete this communication from all data storage
>> devices and destroy all hard copies.
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


CONFIDENTIALITY NOTICE:  This communication (including all attachments) is
confidential and is intended for the use of the named addressee(s) only and
may contain information that is private, confidential, privileged, and
exempt from disclosure under law.  All rights to privilege are expressly
claimed and reserved and are not waived.  Any use, dissemination,
distribution, copying or disclosure of this message and any attachments, in
whole or in part, by anyone other than the intended recipient(s) is strictly
prohibited.  If you have received this communication in error, please notify
the sender immediately, delete this communication from all data storage
devices and destroy all hard copies.

Roy Sigurd Karlsbakk

2011-Apr-18 20:24 UTC

head link

[zfs-discuss] zpool scrub on b123

> I''m going to replace c9t15d0 with a new drive.
> 
> I find it odd that zfs needed to resilver the drive after the reboot.
> Shouldn''t the resilvered information be kept across reboots?
the iostat data, as returned from iostat -en, are not kept over a reboot. I
don''t know if it''s possible to keep them in zfs or otherwise.

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
roy at karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er
et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og
relevante synonymer p? norsk.

Edward Ned Harvey

2011-Apr-18 20:59 UTC

head link

[zfs-discuss] zpool scrub on b123

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Karl Rossing
> 
> So i figured out after a couple of scrubs and fmadm faulty that drive
> c9t15d0 was bad.
> 
> My pool now looks like this:
>          NAME           STATE     READ WRITE CKSUM
>          vdipool        DEGRADED     0     0     2
>            raidz1       DEGRADED     0     0     4
>              c9t14d0    ONLINE       0     0     1  512 resilvered
>              spare      DEGRADED     0     0     0
>                c9t15d0  OFFLINE      0     0     0
>                c9t19d0  ONLINE       0     0     0  16.1G resilvered
>              c9t16d0    ONLINE       0     0     1  512 resilvered
>              c9t17d0    ONLINE       0     0     5  2.50K resilvered
>              c9t18d0    ONLINE       0     0     1  512 resilvered
>          spares
>            c9t19d0      INUSE     currently in use
Um...  Call me crazy, but ...  If c9t15d0 was bad, then why do all those
other disks have checksum errors on them?

Although what you said is distinctly possible (faulty disk behaves so badly
that it causes all the components around it to also exhibit failures), it
seems unlikely.  It seems much more likely that a common component (hba,
ram, etc) is faulty, which could possibly be in addition to c9t15d0.
Another possibility is that the faulty hba (or whatever) caused a false
positive on c9t15d0.  Maybe c9t15d0 isn''t any more unhealthy than all
the
other drives on that bus, which may all be bad, or they may all be good
including c9t15d0.  (It wouldn''t be the first time I''ve seen a
whole batch
of disks be bad, from the same mfgr with closely related serial numbers and
mfgr dates.)

I think you have to explain the checksum errors on all the other disks
before drawing any conclusions.

And the fact that it resilvered immediately after it resilvered...  Only
lends more credence to my suspicion in your bad-disk-diagnosis.

BTW, what OS and what hardware are you running?  How long has it been
running, and how much attention do you give it?  That is - Can you
confidently say it was running without errors for 6 months and then suddenly
started exhibiting this behavior?  If this is in fact a new system, or if
you haven''t been paying much attention, I would not be surprised to see
this
type of behavior if you''re running on unsupported or generic hardware. 
And
when I say "unsupported" or "generic" I mean ... Intel,
Asus, Dell, HP, etc,
big name brands count as "unsupported" and "generic." 
Basically anything
other than sun hardware and software fully updated and still in support
contract, if I''m exaggerating to the extreme.

Karl Rossing

2011-May-12 20:53 UTC

head link

[zfs-discuss] zpool scrub on b123

I have an outage tonight and would like to swap out the LSI 3801 for an 
LSI 9200

Should I zpool export before the swaping the card?

On 04/16/2011 10:45 AM, Roy Sigurd Karlsbakk wrote:>> I''m going to wait until the scrub is complete before diving in
some
>> more.
>>
>> I''m wondering if replacing the LSI SAS 3801E with an LSI SAS
9200-8e
>> might help too.
> I''ve seen similar errors with 3801 - seems to be SAS timeouts.
Reboot the box and it''ll probably work well again for a while. I
replaced the 3801 with a 9200 and the problem was gone.
>
> Vennlige hilsener / Best regards
>
> roy
> --
> Roy Sigurd Karlsbakk
> (+47) 97542685
> roy at karlsbakk.net
> http://blogg.karlsbakk.net/
> --
> I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det
er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og
relevante synonymer p? norsk.

CONFIDENTIALITY NOTICE:  This communication (including all attachments) is
confidential and is intended for the use of the named addressee(s) only and
may contain information that is private, confidential, privileged, and
exempt from disclosure under law.  All rights to privilege are expressly
claimed and reserved and are not waived.  Any use, dissemination,
distribution, copying or disclosure of this message and any attachments, in
whole or in part, by anyone other than the intended recipient(s) is strictly
prohibited.  If you have received this communication in error, please notify
the sender immediately, delete this communication from all data storage
devices and destroy all hard copies.

Richard Elling

2011-May-13 04:47 UTC

head link

[zfs-discuss] zpool scrub on b123

On May 12, 2011, at 1:53 PM, Karl Rossing wrote:
> I have an outage tonight and would like to swap out the LSI 3801 for an LSI
9200
> 
> Should I zpool export before the swaping the card?
A clean shutdown is sufficient. You might need to "devfsadm -c disk"
to build the
device tree.
 -- richard

zfs discuss - Apr 2011 - zpool scrub on b123

[zfs-discuss] zpool scrub on b123

[zfs-discuss] zpool scrub on b123

[zfs-discuss] zpool scrub on b123

[zfs-discuss] zpool scrub on b123

[zfs-discuss] zpool scrub on b123

[zfs-discuss] zpool scrub on b123

[zfs-discuss] zpool scrub on b123

[zfs-discuss] zpool scrub on b123

[zfs-discuss] zpool scrub on b123

[zfs-discuss] zpool scrub on b123

[zfs-discuss] zpool scrub on b123

[zfs-discuss] zpool scrub on b123

[zfs-discuss] zpool scrub on b123